OrderBy on non unique columns will work in index scan? - mysql

id market_id date keyword sku a b c
1 1 2019-01-01 some text for this QAB-XU-VV 3.1 2.4 3.5
2 2 2019-01-02 some text for text ABC-XA-VV 2.1 4.1 1.2
3 1 2019-01-03 some text for XXX DDD-XA-RR 2.7 3.5 4.1
I need to query like this
SELECT
sku,
keyword,
SUM(a),
SUM(b),
SUM(c),
FROM A
WHERE market_id = 2 AND date BETWEEN '2020-01-01' and '2020-02-02'
GROUP BY sku, keyword
LIMIT 10
OFFSET XX
I used SampleTable(assume this table as SampleTable) and SampleTable(sku, keyword) composite-index. so this query will do index scan. query time is fast but I need to add LIMIT...OFFSET for pagination.So what I need to know is that this query will sort as SORT BY ID? I can't use ORDER BY ID since if I use this, this query will do file sort and will be very slow since it will use full scan instead of index scan. Here is what I'm confusing.
MySQL has two ways to produce ordered results: it can use a filesort,
or it can scan an index in order.
Ordering the results by the index works only when the index’s order is
exactly the same as the ORDER BY clause and all columns are sorted in
the same direction (ascending or descending).
The ORDER BY clause also has the same limitation as lookup queries: it
needs to form a leftmost prefix of the index. In all other cases,
MySQL uses a filesort.
Please help me what I'm wrong. Thank you.

WHERE, GROUP BY, and ORDER BY each want to scan the entire table.
One index may help.
In your query, INDEX(market_id, date) (or an index starting with those two columns) can avoid a full table scan and scan only part of the index. But that does nothing toward the GROUP BY or ORDER BY. Furthermore, since the WHERE clause is not all = tests, tacking on the GROUP BY columns is futile.
If you were able to get past the WHERE, then GROUP BY and ORDER BY lead to separate passes unless they are 'identical'.
I believe your query, as it stands, is doomed to need 4 passes:
Scan over index to handle the filtering by WHERE.
Sort and "group" the resulting temp table
Sort for ORDER BY.
Skip over OFFSET rows; only then can you deliver the 10 desired by LIMIT.

Related

MySQL: Why 5th ID in the IN clause drastically changes query plan?

Given the following two queries:
Query #1
SELECT log.id
FROM log
WHERE user_id IN
(188858, 188886, 189854, 203623, 204072)
and type in (14, 15, 17)
ORDER BY log.id DESC
LIMIT 25 OFFSET 0;
Query #2 - 4 IDs instead 5
SELECT log.id
FROM log
WHERE user_id IN
(188858, 188886, 189854, 203623)
and type in (14, 15, 17)
ORDER BY log.id DESC
LIMIT 25 OFFSET 0;
Explain Plan
-- Query #1
1 SIMPLE log range idx_user_id_and_log_id idx_user_id_and_log_id 4 41280 Using index condition; Using where; Using filesort
-- Query #2
1 SIMPLE log index idx_user_id_and_log_id PRIMARY 4 53534 Using where
Why the addition of a single ID makes the execution plan so different? I'm talking about a difference in time of milliseconds to ~1 minute. I thought that it could be related to the eq_range_index_dive_limit parameters, but it's bellow 10 anyway (the default). I know that I can force the usage of the index instead of the clustered index, but I wanted to know why MySQL decided that.
Should I try to understand that? Or sometimes it's not possible to understand query planner decisions?
Extra Details
Table Size: 11GB
Rows: 108 Million
MySQL: 5.6.7
Doesn't matter which ID is removed from the IN clause.
The index: idx_user_id_and_log_id(user_id, id)
As you have shown, MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:
Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.
In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, one probably has to read a lot more of the table in sorted order in order to find the first 25 qualifying rows than what the optimizer expected.
There have been several improvements in how LIMIT queries are handled, both in later releases of 5.6 (you are running on a pre-GA release!), and in newer releases (5.7, 8.0). I suggest you try to upgrade to a later release, and see if this still is an issue.
In general, if you want to understand query planner decisions, you should look at the optimizer trace for the query.
JOIN is much more efficient.
Create a temporary table with the values of the IN operator.
Then make a JOIN between table 'log' to the temporary table of values.
Refer to this answer
for more info.
Add
INDEX(user_id, type, id),
INDEX(type, user_id, id)
Each of these is a "covering" index. As such, the entire query can be performed by looking only in one index, without touching the 'data'.
I have two choices for the Optimizer -- hopefully it will be able to pick whether user_id IN (...) is more selective or type IN (...) in order to pick the better index.
If, after adding those, you don't have any use for idx_user_id_and_log_id(user_id, id), DROP it.
(No, I can't explain why query 2 chose to do a table scan.)

Query time suddenly increased

I have MariaDB 10.1.14, For a long time I'm doing the following query without problems (it tooks about 3 seconds):
SELECT
sum(transaction_total) as sum_total,
count(*) as count_all,
transaction_currency
FROM
transactions
WHERE
DATE(transactions.created_at) = DATE(CURRENT_DATE)
AND transaction_type = 1
AND transaction_status = 2
GROUP BY
transaction_currency
Suddenly, I'm not sure exactly why, this query take about 13 seconds.
This is the EXPLAIN:
And those are the all indexes of transactions table:
What is the reason for the sudden query time increase? and how can I decrease it?
If you are adding more data to your table the query time will increase.
But you can do a few things to improve the performance.
Create a composite index for ( transaction_type, transaction_status, created_at)
Remove the DATE() functions (or any function) from your fields, because that doesn't allow engine use the index. CURRENT_DATE is a constant so there doesn't matter, but isn't necessary because already return DATE
if created_at isnt date you can use
created_at >= CURRENT_DATE and created_at < CURRENT_DATE + 1
or create a different field to only save the date part.
+1 to answer from #JuanCarlosOropeza, but you can go a little further with the index.
ALTER TABLE transactions ADD INDEX (
transaction_type,
transaction_status,
created_at,
transaction_currency,
transaction_total
);
As #RickJames mentioned in comments, the order of columns is important.
First, columns in equality comparisons
Next, you can index one column that is used for a range comparison (which is anything besides equality), or GROUP BY or ORDER BY. You have both range comparison and GROUP BY, but you can only get the index to help with one of these.
Last, other columns needed for the query, if you think you can get a covering index.
I describe more detail about index design in my presentation How to Design Indexes, Really (video: https://www.youtube.com/watch?v=ELR7-RdU9XU).
You're probably stuck with the "using temporary" since you have a range condition and also a GROUP BY referencing different columns. But you can at least eliminate the "using filesort" by this trick:
...
GROUP BY
transaction_currency
ORDER BY NULL
Supposing that it's not important to you which order the rows of the query results return in.
I don't know what has made your query slower. More data? Fragmentation? New DB version?
However, I am surprised to see that there is no index really supporting the query. You should have a compound index starting with the column with highest cardinality (the date? well, you can try different column orders and see which index the DBMS picks for the query).
create index idx1 on transactions(created_at, transaction_type, transaction_status);
If created_at contains a date part, then you may want to create a computed column created_on only containing the date and index that instead.
You can even extend this index to a covering index (where clause fields followed by group by clause fields followed by select clause fields):
create index idx2 on transactions(created_at, transaction_type, transaction_status,
transaction_currency, transaction_total);

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).
All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

What is the best way to sort by columns in mysql and use index?

I have a table with 10 columns, Now I want to give the users an option to sort the data with any column they want. For example suppose a combo box with 7 items that each of them is a column of the table, now the user choose one item and get the data sorted by the chosen column.
Now what is the problem?
My table has 3M records, and if I sort the data with indexed column I have no problem but with a non index column it takes 3.5mins to sort!!!
What is the solution I am thinking about?
Add index to every column of table that is needed to be sort by! In my case I will have index on 8 columns!!!!
What is the problem of my solution?
Having a lot of index on columns may decrease the speed of INSERT/UPDATE queries! In my case the table is updated frequently (every second!!!!!)
What is your solution for this case?!
Read this for more details on optimization: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. Using index for sorting often comes together with using index to find rows, however it can also be used just for sort for example if you’re just using ORDER BY without and where clauses on the table. In such case you would see “Index” type in EXPLAIN which correspond to scanning (potentially) complete table in the index order. It is very important to understand in which conditions index can be used to sort data together with restricting amount of rows.
Looking at the same index (A,B) things like ORDER BY A ; ORDER BY A,B ; ORDER BY A DESC, B DESC will be able to use full index for sorting (note MySQL may not select to use index for sort if you sort full table without a limit). However ORDER BY B or ORDER BY A, B DESC will not be able to use index because requested order does not line up with the order of data in BTREE. If you have both restriction and sorting things like this would work A=5 ORDER BY B ; A=5 ORDER BY B DESC; A>5 ORDER BY A ; A>5 ORDER BY A,B ; A>5 ORDER BY A DESC which again can be easily visualized as scanning a range in BTREE. Things like this however would not work A>5 ORDER BY B , A>5 ORDER BY A,B DESC or A IN (3,4) ORDER BY B – in these cases getting data in sorting form would require a bit more than simple range scan in the BTREE and MySQL decides to pass it on.
Option #1: If you are limited to MySQL there's no better option but create 8 indexes for the possible order columns. You're insert/update are going to suffer it for sure but no real visitor will wait for 3.5 minutes for a list to be sorted.
Tune #1: To make it a little faster you can create partial indexes instead of standard indexes which will use much less space (I assume some of these columns are varchar) and this means less writes, smaller footprint in memory. You just need to check the entropy for each column with the substring and make sure you still have distinction over 90%.
For example with a query like this:
> select count(distinct(substring(COLUMN, 1, 5))) as part_5, count(distinct(substring(COLUMN, 1, 10))) as part_10, count(distinct(substring(COLUMN, 1, 20))) as part_20, count(distinct(COLUMN)) as sum from TABLE;
+--------+---------+---------+---------+
| part_5 | part_10 | part_20 | sum |
+--------+---------+---------+---------+
| 892183 | 1996053 | 1996058 | 1996058 |
+--------+---------+---------+---------+
Tune #2: You can make you insert/update statements to execute in the background. The application won't be faster but the user experience is going to be much better.
Tune #3: Use bigger transactions if you can for the inserts/updates.
Option #2: You can try to use one of the search engines which have been built for this usage pattern (too). I would recommend Solr as I'm using it for a while with great satisfaction but I heard good about elastic search as well.

Optimizing query instead of using order by

I want to run a simple query to get the "n" oldest records in the table. (It has a creation_date column).
How can i get that without using "order-by". It is a very big table and using order by on entire table to get only "n" records is not so convincing.
(Assume n << size of table)
When you are concerned about performance, you should probably not discard the use of order by too early.
Queries like that can be implemende as Top-N query supported by an appropriate index, that's running very fast because it doesn't need to sort the entire table, not even the selecte rows, because the data is already sorted in the index.
example:
select *
from table
where A = ?
order by creation_date
limit 10;
without appropriate index it will be slow if you are having lot's of data. However, if you create an index like that:
create index test on table (A, creation_date );
The query will be able to start fetching the rows in the correct order, without sorting, and stop when the limit is reached.
Recipe: put the where columns in the index, followed by the order by columns.
If there is no where clause, just put the order by into the index. The order by must match the index definition, especially if there are mixed asc/desc orders.
The indexed Top-N query is the performance king--make sure to use them.
I few links for further reading (all mine):
How to use index efficienty in mysql query
http://blog.fatalmind.com/2010/07/30/analytic-top-n-queries/ (Oracle centric)
http://Use-The-Index-Luke.com/ (not yet covering Top-N queries, but that's to come in 2011).
I haven't tested this concept before but try and create an index on the creation_date column. Which will automatically sort the rows is ascending order. Then your select query can use the orderby creation_date desc with the Limit 20 to get the first 20 records. The database engine should realize the index has already done the work sorting and wont actually need to sort, because the index has already sorted it on save. All it needs to do is read the last 20 records from the index.
Worth a try.
Create an index on creation_date and query by using order by creation_date asc|desc limit n and the response will be very fast (in fact it cannot be faster). For the "latest n" scenario you need to use desc.
If you want more constraints on this query (e.g where state='LIVE') then the query may become very slow and you'll need to reconsider the indexing strategy.
You can use Group By if your grouping some data and then Having clause to select specific records.