MySQL: Group by query optimization

MySQL: Group by query optimization - mysql

I've got a table of the following schema:
+----+--------+----------------------------+----------------------------+
| id | amount | created_timestamp | updated_timestamp |
+----+--------+----------------------------+----------------------------+
| 1 | 1.00 | 2018-01-09 12:42:38.973222 | 2018-01-09 12:42:38.973222 |
+----+--------+----------------------------+----------------------------+
Here, for id = 1, there could be multiple amount entries. I want to extract the last added entry and its corresponding amount, grouped by id.
I've written a working query with an inner join on the self table as below:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
INNER JOIN (SELECT id,
Max(updated_timestamp) AS last_transaction_time
FROM transactions
GROUP BY id) AS latest_transactions
ON latest_transactions.id = t1.id
AND latest_transactions.last_transaction_time =
t1.updated_timestamp;
I think inner join is an overkill and this can be replaced with a more optimized/efficient query. I've written the following query with where, group by, and having but it isn't working. Can anyone help?
select id, any_value(`updated_timestamp`), any_value(amount) from transactions group by `id` having max(`updated_timestamp`);

There are two (good) options when performing a query like this in MySQL. You have already tried one option. Here is the other:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
LEFT OUTER JOIN transactions later_transactions
ON later_transactions.id = t1.id
AND later_transactions.last_transaction_time > t1.updated_timestamp
WHERE later_transactions.id IS NULL
These methods are the ones in the documentation, and also the ones I use in my work basically every day. Which one is most efficient depends on a variety of factors, but usually, if one is slow the other will be fast.
Also, as Strawberry points out in the comments, you need a composite index on (id,updated_timestamp). Have separate indexes for id and updated_timestamp is not equivalent.
Why a composite index?
Be aware that an index is just a copy of the data that is in the table. In many respects, it works the same as a table does. So, creating an index is creating a copy of the table's data that the RDBMS can use to query the table's information in a more efficient manner.
An index on just updated_timestamp will create a copy of the data that contains updated_timestamp as the first column, and that data will be sorted. It will also include a hidden row ID value (that will work as a primary key) in each of those index rows, so that it can use that to look up the full rows in the actual table.
How does that help in this query (either version)? If we wanted just the latest (or earliest) updated_timestamp overall, it would help, since it can just check the first or last record in the index. But since we want the latest for each id, this index is useless.
What about just an index on id. Here we have a copy of the id column, sorted by the id column, with the row ID attached to each row in the index.
How does this help this query? It doesn't, because it doesn't even have the updated_timestamp column as part of the index, and so won't even consider using this index.
Now, consider a composite index: (id,updated_timestamp).
This creates a copy of the data with the id column first, sorted, and then the second column updated_timestamp is also included, and it is also sorted within each id.
This is the same way that a phone book (if people still use those things as something more than paperweights) is sorted by last name and then first name.
Because the rows are sorted in this way, MySQL can look, for each id, at just the last record of a given id. It knows that that record contains the highest updated_timestamp value, because of how the index is defined.
So, it only has to look up one row for each id that exists. That is fast. Further explanation into why would take up a lot more space, but you can research it yourself if you like, by just looking into B-Trees. Suffice to say, finding the first (or last) record is easy.
Try the following:
ALTER TABLE transactions
ADD INDEX `LatestTransaction` (`id`,`updated_timestamp`)
And then see whether your original query or my alternate query is faster. Likely both will be faster than having no index. As your table grows, or your select statement changes it may affect which of these queries is faster, but the index is going to provide the biggest performance boost, regardless of which version of the query you use.

Related

MySQL super slow inner join with group by

I'm having a problem joining the 2 tables below. What I need is all of the parts in the first table where the clei OR part number is found in the second table, with a count of how many matches there are from table 1.
=================== ===================
table: svi table: svp
=================== ===================
id id
po price
customer clei
clei partNumber
partNumber description
==================== ===================
svi has about 1 million rows. svp has about 2000. Here is the join that I'm using...
SELECT svi.clei,
svi.partNumber,
count(*)
FROM svp svp
INNER JOIN
svi svi
ON (svp.clei = svi.clei)
OR (svp.partNumber = svi.partNumber)
GROUP BY svi.partNumber
The query is taking a little over 2 minutes to run, which seems ridiculously slow. clei and partNumber are indexed in both tables. What else can I do to speed up this join?

The indexes don't help very much here because there are no WHERE conditions against constants and because of the OR operator.
All the 2000 rows of the svp table are read; conditions against constants reduce the number of rows read from a table but there is no such condition here.
Then, for each of these 2000 rows, one or two lookups are performed in the indexes of the svi table to identify the matching rows. One for clei and, if it doesn't succeed, another one for partNumber. Or viceversa.
A compound index on columns clei and partNumber on table svi doesn't help here; it helps when the conditions are combined using OR.
The indexes on table svp are not used. If there is an index on svp that contains both clei and partNumber columns then MySQL can decide to read it here just because it contains less data than the entire table. But it still reads the entire index and processes all the rows. It cannot use the index to filter rows because there is no filtering on svp.
It could be worse (read the entire svi table and use the indexes on svp for lookup) but MySQL is smart enough to process the smaller table first.
Put EXPLAIN in front of your query and MySQL tells you (in less words) what I tried to explain above.
As I also said in a comment, the query is invalid SQL. For one value of svi.partNumber you probably have more than one value for svi.clei. The GROUP BY svi.partNumber clause generates a single output row from all the rows it gets from table svi that have the same value for partNumber.
But, since there are two or more different values for clei for the same partNumber, the final value it pics for the expression svi.clei from the SELECT clause is indeterminate. This means it can change if you run the same query again later or if you run it on a different server that mirrors the database (or after the database is backed up then restored from the backup).
If you just forgot to add svi.clei in the GROUP BY clause then it's an easy fix but otherwise you have to re-think your query because as it is now, it doesn't produce the results you expect.

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).

All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

If your table has more selects than inserts, are indexes always beneficial?

I have a mysql innodb table where I'm performing a lot of selects using different columns. I thought that adding an index on each of those fields could help performance, but after reading a bit on indexes I'm not sure if adding an index on a column you select on always helps.
I have far more selects than inserts/updates happening in my case.
My table 'students' looks like:
id | student_name | nickname | team | time_joined_school | honor_roll
and I have the following queries:
# The team column is varchar(32), and only has about 20 different values.
# The honor_roll field is a smallint and is only either 0 or 1.
1. select from students where team = '?' and honor_roll = ?;
# The student_name field is varchar(32).
2. select from students where student_name = '?';
# The nickname field is varchar(64).
3. select from students where nickname like '%?%';
all the results are ordered by time_joined_school, which is a bigint(20).
So I was just going to add an index on each of the columns, does that make sense in this scenario?
Thanks

Indexes help the database more efficiently find the data you're looking for. Which is to say you don't need an index simply because you're selecting a given column, but instead you (generally) need an index for columns you're selecting based on - i.e. using a WHERE clause (even if you don't end up including the searched column in your result).
Broadly, this means you should have indexes on columns that segregate your data in logical ways, and not on extraneous, simply informative columns. Before looking at your specific queries, all of these columns seem like reasonable candidates for indexing, since you could reasonably construct queries around these columns. Examples of columns that would make less sense would be things phone_number, address, or student_notes - you could index such columns, but generally you don't need or want to.
Specifically based on your queries, you'll want student_name, team, and honor_roll to be indexed, since you're defining WHERE conditions based on the values of these columns. You'll also benefit from indexing time_joined_school if, as you suggest, you're ORDER BYing your queries based on that column. Your LIKE query is not actually easy for most RDBs to handle, and indexing nickname won't help. Check out How to speed up SELECT .. LIKE queries in MySQL on multiple columns? for more.
Note also that the ratio of SELECT to INSERT is not terribly relevant for deciding whether to use an index or not. Even if you only populate the table once, and it's read-only from that point on, SELECTs will run faster if you index the correct columns.

Yes indexes help on accerate your querys.
In your case you should have index on:
1) Team and honor_roll from query 1 (only 1 index with 2 fields)
2) student_name
3) time_joined_school from order
For the query 3 you can't use indexes because of the like statement. Hope this helps.

optimizing particular query mysql

So I've been searching for a solution and reading books, and havent been able to figure it out, the question is rather simple, I have 2 tables. On one table I have 2 fields:
table_1:"chromosome" and "position" both of the being integers.
table_2:"chromosome" "start" and "end", all being integers as well.
I want a query that gives me back all rows from table_1 that are between the start and end of table_2. The query looks like this:
SELECT
table_1 . *
FROM
table_1,
table_2
WHERE
table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end;
So this query works fine, but my tables are many millions of rows (7092713) and (215909) respectvely. I indexed chromosome, pos and chromosome, start, end. The weird part is that if I do the query one by one (perl DBI, do one statement for every row of table_2), this runs a lot faster. Not sure where am I screwing up.
Any help would be appreciated.
Jorge Kageyama

For the sake of clarity, let's start by recasting your query using the standard JOIN syntax. The query is equivalent but easier to read.
SELECT table_1 . *
FROM table_1
JOIN table_2 ON ( table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end)
Second, it's smart when searching large tables (or any tables for that matter) to avoid * in your SELECT clauses. Using * denies useful data to the optimizer about what you do, or don't, need in your result set. So let us say
SELECT table_1.chromosome, table_1.position
for SELECT.
So, it becomes clear that your result set, and your join, need chromosome and position, and nothing else, from your larger table. Try creating a compound BTREE index on that table, as follows.
CREATE INDEX ON table_1(chromosome,position) USING BTREE
Similarly, try creating an index on table_2 as follows.
CREATE INDEX ON table_2(chromosome,start, end) USING BTREE
These are called covering indexes. They contain enough columns that the query can be satisfied from the index without having to bounce back to the original table.
BTREE indexes (the default by the way) are inherently ordered. Appropriate records in table_1 can be found by range scans on the index starting with (chromosome,start) and ending with (chromosome,end).
Third, it's possible you're getting a massive combinatorial explosion of rows from table_1 in your result set. You'll get a row for every combination of rows in the two tables that matches your ON() clause. It's hard to know whether that's the case without knowing a lot about your data.
You could try to reduce that combinatorial explosion using
SELECT DISTINCT table_1.chromosome, table_1.position
Give this a try. If you're still not getting anywhere, maybe another question with complete table definitions and the results of EXPLAIN will be helpful.

Interesting question. Without knowing more about the quantities contained in "position," I would still approach it generally in this way:
Select for position generally from table_1 (with 7.0mm entities) so that the resulting table is a bin of a smaller amount of data. Let's say, for instance, that the "position" quantity is a set of discrete integers from 2-9. Select from table_1 where position is equal to 2, then select from table_2 where "start" is less than 2 and "end" is greater than 2. Iterate over this query selection 8 times updating a new table_3 with results.
I am assuming here that table_2 is unique on chromosome, and table_1 is not. Therefore, you end up with chromosomes that could have multiple positions within the same range (a chromosome has one range, but can appear anywhere within that range). You also, then, can't tell how large the resulting join table is going to be, but it could be quite large as each of the 7mm entities in table_1 could be within all ranges in table_2.
Iterating would allow you to "grow" your results while observing the quality at each point experimentally before committing to the entire loop.
Here is an idea of the query I have in mind (untested):
SELECT table_1.chromosome, table_1.position, table_2.start, table_2.end
FROM
(SELECT table_1.chromosome, table_1.position
from table_1 where table_1.position = 2)
JOIN
(SELECT table_2.chromosome, table_2.start, table_2.end
from table_2 where table_2.start < 2 AND table_2.end > 2)
ON
table_1.chromosome = table_2.chromosome
Good luck, and I hope you find your answer!

How to prevent MySQL selecting one index when a better one is available?

I have a table with 30,000 rows (and growing), which I join with another table. One some pages, I need to run a some 100+ of those queries, and things get slow. If I EXPLAIN the query, I notice that one table uses a primary key and is fast, but another table using one of its indexes, which is not the best one. Here's an overview:
SIMPLE | acc_entries | ref | ledger,date,type,status,status_ledger_date_type | type | 1 | const | 15359 | Using where
This is a sample query:
SELECT SUM(usd) AS total FROM acc_entries
LEFT JOIN acc_ledgers ON acc_entries.ledger = acc_ledgers.id
WHERE acc_entries.status = 1 AND
acc_ledgers.account = 3004 AND
date >= '2011-01-01' AND
date <= '2011-08-30' AND
type = 'credit'
As you can see, I am using in my WHERE the fields status, ledger (which is the field that joins with acc_ledgers.account), date and type. All of these fields have indexes. However, there is also a specific index that is used for all of them, in that same order. It is called status_ledger_data_type, and as you can see it is one of the indexes that MySQL considers using. However, at the end MySQL opts to use type as an index. This has some 15,000 possible rows (half of the table), whereas the other combined index only features a fraction of this. So my questions is: why does MySQL selects this index when a better one is available, and how can I prevent this?

You can try using index hints to force the use of your desired index.
MySql docs on Index Hints
The Battle Between Force Index and the Query Optimizer
7 ways to convince MySQL to use the right index

Actually, you want your index based on your smaller granularity. The Ledger from your Acc_Entries table will join to your ACC_Ledgers table on ITS primary index of ID, so the Acc_Ledgers is not really utilizing the Ledger portion for the WHERE clause. Your index should match as closely to the WHERE clause of your common queries. In this case, I would have an index on
(Account, Status, Type, Date)
The reason for Account being first, smaller result set. You could have 5,000 entries. Of those, 300 entries for the one account accounts, so you've already eliminated a huge amount of data to go through. Then, the Status... of the 300, you could have 100 # status 1, 100 # status 2, 100 # status 3, so you've now reduced the set even more, etc by other criteria of type and date.
Your query otherwise is completely fine... just a personal style in writing, I try to write my queries with the WHERE conditions as closely matching the index in same sequence too, so I would just have the Account clause first, then Status, Type and Date... but again, thats a personal style in writing queries.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008