MySQL Query suggestions - column vs joint - mysql

We have a user table like this - with over 20 million entries.
CREATE TABLE `users` (
`uid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(64) default '',
`email` varchar(64) default '',
`flag` int(10) unsigned DEFAULT '0',
PRIMARY KEY (`uid')
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
In our admin panel, we'd like to show a few pinned users and search results from the user table.
There are two approaches we thought to show pinned user (pls suggest if any other better approaches)
1) add a separate column in user table for pinned users. However, pinned users are a handful (less than 100) compared to the total number of users (> 20M). Hence, this approach doesn't appear promising.
2) a separate table of pinned users and use join,
CREATE TABLE `pinnedusers` (
`uid` int(10) unsigned NOT NULL default 0,
PRIMARY KEY (`uid`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and run a join, for example,
select *
from users
left join pinnedusers
on pinnedusers.uid=users.uid
order by pinnedusers.uid desc
limit 200;
However, we are worried about the performance of the second approach as it involves join, order, limit.
What do you suggest?

This should produce the results you want, and indicate which rows are "pinned" users.
SELECT
a.*,
IF(b.`id` IS NULL,0,1) as `is_pinned`
FROM users a
LEFT JOIN pinnedusers b
on b.uid = a.uid
ORDER BY IF(b.`id` IS NULL,0,1) DESC, a.uid DESC
LIMIT 200;

It's not the JOIN to fear, it's the need to scan the entire 20M rows.
To "show a few pinned users", use this to "show up to 200 pinned users":
SELECT u.*
FROM pinnedusers p
JOIN users u USING(uid)
ORDER BY ...
LIMIT 200;
If there are fewer than 200 pinned users, the list will have less than 200. But the query will be fast.
Your original query and Sloan's are both terribly slow because they will have to scan the entire 20M rows (assuming less than 200 are pinned).
If you want all the pinned users, plus enough non-pinned users to fill out a list of 200, that is a different task. But, which non-pinned users would you like? The first few? That would be quick. A random few? That is more complex, else it, too, would scan the entire 20M. Some other criteria?

Related

Understanding why group by query slows down when there are lots of text columns

I have a query which runs slowly, I've come up with a much faster alternative, but I'd like some help understanding why the original query is so slow.
A simplified version of my problem use two tables. A simplified version of the first table, called profiles, is
`profiles` (
`id` int(11),
`title` char(255),
`body` text,
`pin` int(11),
PRIMARY KEY (`id`),
UNIQUE KEY `pin` (`pin`)
)
The simplified version of my second table, calls, is
`calls` (
`id` int(11),
`pin` int(11),
`duration` int(11),
PRIMARY KEY (`id`),
KEY `ivr_id` (`pin`)
)
My query is supposed to get the full profiles, with the addition of the number of calls received by a profile. The query I was using, was
SELECT profiles.*, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls
ON profiles.pin = calls.pin
GROUP BY profiles.pin
With ~100 profiles and ~250,000 calls, this query takes about 10 seconds which is slow.
If I modify the query to just select the title from profiles, not all columns, the query is much faster. If I modify the query to remove the group by, its also much faster. If I just select everything from the profiles table, then its also a fast query.
My actual profile table has many more text and char fields. The query speed is worse the more text fields that are selected. Why are the text fields causing the query to be so slow, when they are not involved in the JOIN or the GROUP?
I came up with a slightly different query which is much faster, less than half a second. This query is:
SELECT profiles.*, temp.readings
FROM profiles
LEFT JOIN (
SELECT pin ,COUNT(*) AS readings
FROM calls
GROUP BY pin
) AS temp
ON temp.pin=profiles.pin
Whilst I think I've solved by speed problem, I'd like to understand what is causing the issue in the first query.
======== Update ========
I've just profiled both the queries and the entire speed difference is in the 'sending data' section. The slow query is about 10 seconds and the faster query is about 0.1 seconds
======== Update 2 ========
After discussing with #scaisEdge, I think I can rephrase my question. Consider a table T1 that has ~40 columns of which 8 are of type TEXT and ~100 rows and table T2 which has 5 columns of type INT and VARCHAR with ~250,000 rows. Why is it that:
SELECT T1.* FROM T1 is fast
SELECT T1.* FROM T1 JOIN T2 GROUP BY T1.joinfield is slow
SELECT T1.selectfield FROM T1 JOIN T2 GROUP BY T1.joinfield is fast if selectfield is an INT or VARCHAR
This should happpend because
The first query join 100 profile with 250,000 calls and then reduce the returning rows grouping by the result . And the select profile.* implies the full accessing for each matching row to the profile table data
Then second query join 100 profile with the number of rows returned by subquery for TEMP (problably much less than 250,000) reducing the number of the access to the table profile data
Instead of profile.* try accessing only to pin column
SELECT profiles.pin, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls ON profiles.pin = calls.pin
GROUP BY profiles.pin
As a suggestion you should take note that use of group by for the first query is allowed only for mysql version earlier than version 5.7 .. because the use of group by column without mention column in select clause and not affected by aggregation function and not mention in GROUP BY is not allowed by defualt and produce error ..

Performance with MySql query to load recent chat messages

I'm having some issues with performance for a MySql query for a chat application I'm in the process of building.
I'm trying to grab the most recent messages from a conversation. I'm testing with a table with approx 3 million rows in it (an export from an older version of the application). When loading from some conversations, it's quick. When loading from others, the query takes significantly longer.
Here's details on the table setup, it's an InnoDB table:
Column Type Comment
id int(10) unsigned Auto Increment
from int(10) unsigned NULL
to int(10) unsigned NULL
date int(10) unsigned NULL
message text NULL
read tinyint(1) NULL [0]
And here are the indexes I have:
PRIMARY id
INDEX from
INDEX to
INDEX date
This is an example of the current query that I'm running:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 342)
OR
(`to` = 2 and `from` = 342)
ORDER BY `id` DESC
LIMIT 10
Now, when I run this query with this user combination (which only has a total of 325 rows in the database), it takes 1.5+ seconds.
However, if I use a different user combination which has a total of 12,000 rows in the database, like this:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 10153)
OR
(`to` = 2 and `from` = 10153)
ORDER BY `id` DESC
LIMIT 10
Then the query runs in approximately 35-40 ms. Quite a big difference, and the opposite of what I would expect.
I'm sure I'm missing something here and would appreciate any help pointing me in the right direction for optimizing all of this.
it's not about how much records the user has. you have created one table for all chats which is an issue when you try to fetch first 10 records of a user have inserted entries recently will be served fatser.
Well, Another thing you can try is rather than using OR, use UNION which will give a little advantage.
Try to use this:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 342)
UNION
SELECT *
FROM `chat`
WHERE
(`to` = 2 and `from` = 342)
ORDER BY `id` DESC
LIMIT 10
Time taken by query in your case will also depend on how long ago any user messaged.
For that you should change your model and not have all messages in a single table.

Does performance of this query can be improved

I have a very slow query in MySql server.
I add the query :
SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT, TERM, START_DT, END_DT, CTYPE, MW AS MW_AWARD,
Mark, SCID
FROM
( SELECT a.CRR_DT, a.TOU, a.SRCE, a.SINK, a.NAME, a.SEASON, a.SRCESUMCONG,
a.SINKSUMCONG, a.SRCEAVGCONG, a.SINKAVGCONG, a.SUMSINKMSRCE,
a.AVGSINKMSRCE, a.HOURCOUNT, b.TERM, b.CTYPE, b.START_DT,
b.END_DT, b.MW, b.SCID, b.Mark
FROM
( SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01'
) a
INNER JOIN
( SELECT MARKET, TERM, TOU, SRCE, SINK, NAME, SCID, CTYPE, START_DT,
END_DT, SUM(MW) AS MW, SUBSTR(MARKET, 1, 3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL' , 'LDES')
GROUP BY MARKET , TOU , SRCE , SINK , NAME , SCID , CTYPE ,
START_DT , END_DT
) b ON a.NAME = b.NAME
AND a.TOU = b.TOU
) c
WHERE c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7 )
ORDER BY NAME , CRR_DT , TOU ASC
Here the result its Explain plan generated using MysQl Workbrench
I guess that the red block red are dangerous. Can please someone help me to understand this plan? Few hints on what I should check once I have this execution plan.
edit add TABLES layout
CREATE TABLE `CRR_CONGCALC` (
`CRR_DT` varchar(7) NOT NULL,
`TOU` varchar(50) NOT NULL,
`SRCE` varchar(50) NOT NULL,
`SINK` varchar(50) NOT NULL,
`SRCESUMCONG` decimal(12,6) DEFAULT NULL,
`SINKSUMCONG` decimal(12,6) DEFAULT NULL,
`SRCEAVGCONG` decimal(12,6) DEFAULT NULL,
`SINKAVGCONG` decimal(12,6) DEFAULT NULL,
`SUMSINKMSRCE` decimal(12,6) DEFAULT NULL,
`AVGSINKMSRCE` decimal(12,6) DEFAULT NULL,
`HOURCOUNT` int(11) NOT NULL DEFAULT '0',
`SEASON` char(1) NOT NULL DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`CRR_DT`,`SRCE`,`SINK`,`TOU`,`HOURCOUNT`),
KEY `srce_index` (`SRCE`),
KEY `srcesink` (`SRCE`,`SINK`)
)
CREATE TABLE `CRR_INVENTORY` (
`MARKET` varchar(50) NOT NULL,
`TERM` varchar(50) NOT NULL,
`TOU` varchar(50) NOT NULL,
`INVENTORY_DT` date NOT NULL,
`START_DT` datetime NOT NULL,
`END_DT` datetime NOT NULL,
`CRR_ID` varchar(50) NOT NULL,
`NSR_INDEX` tinyint(1) NOT NULL,
`SEGMENT` tinyint(1) NOT NULL,
`CTYPE` varchar(50) NOT NULL,
`CATEGORY` varchar(50) NOT NULL,
`COPTION` varchar(50) NOT NULL,
`SRCE` varchar(50) DEFAULT NULL,
`SINK` varchar(50) DEFAULT NULL,
`MW` decimal(8,4) NOT NULL,
`SCID` varchar(50) NOT NULL,
`SEASON` char(1) DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`MARKET`,`INVENTORY_DT`,`CRR_ID`),
KEY `srcesink` (`SRCE`,`SINK`)
)
Brings back memories. With a database, a "Full Table Scan" means that there is nothing that the database can use to speed up the query, it reads the entire table. The rows are stored in a non-sorted order, so there is no better way to "search" for the employee id you are looking for.
This is bad. Why?
If you have a table with a bunch of columns:
first_name, last_name, employee_id, ..., column50 and do a search where employee_id = 1234, if you don't have an index on the employee_id column, you're doing a sequential scan. Even worse if you're doing a join table2 on table1.employee_id = table2.eid, because it has to match the employee_id to every record in the join table.
If you create an index, you greatly reduce the scan time to find the matches (or throw away the non-matches) because instead of doing a sequential scan you can search a sorted field. Much faster.
When you create an index on the employee_id field, you are creating a way to search for employee numbers that is much, much, much faster. When you create an index, you are saying "I am going to join based on this field or have a where clause based on this field". This speeds up your query at the cost of a little bit of disk space.
There are all kinds of tricks with indexes, you can create them so they are unique, not unique, composite (contain multiple columns) and all kinds of stuff. Post your query and we can tell you what you might look at indexing to speed this up.
A good rule of thumb is that you should create an index on your tables on fields that you use in a where clause, join criteria or order by. Picking the field depends on a few things that are beyond the scope of this discussion, but that should be a start.
The pattern FROM ( SELECT... ) JOIN ( SELECT... ) ON ... does not optimize well. See if you can go directly from one of the tables, not hide it in a subquery.
CRR_CONGCALC needs INDEX(CRR_DT). (Please provide SHOW CREATE TABLE.)
CRR_INVENTORY needs INDEX(COPTION, START_DT).
Please make those changes, then come back for more advice, if needed.
According to your explain diagram, there are full table scans happening at each sub-query on CRR_CONGCALC and CRR_INVENTORY. Then when you join the sub-queries together, another full table scan, and finally, when the result set is ordered, one more full table scan.
A Few tips to improve performance
Use fields that are indexed as part of your join statement, where clause, group by clause & order by clause. If this query is used often, consider adding indexes to all relevant columns.
Avoid nested sub-queries with aggregate operations in joins as much as possible. The result-sets returned by the sub-queries are not indexed, joining on it will end up scanning the whole table, rather than just the index. The joins in this query could also result in weird and hard to detect fanning out issues, but this isn't a performance issue that you're seeking a solution for
Filter the result set as early as possible (i.e. in all the sub-queries at the inner most layer to minimize the # of rows the database server has to subsequently deal with.
Unless the final order by is necessary, avoid it.
Use temporary (or materialized) tables to de-nest subqueries. On these tables, you can add indexes, so further joining will be efficient. This assumes that you have the permissions to create & drop tables on the server
That said,
Here's how I would refactor your query.
In generating the inner query b, the group by clause does not contain all fields which are not aggregate columns. This is non standard sql, which leads to malformed data. Mysql allows it, and for the love of god I don't know why. It is best to avoid this trap.
The final wrapping query is unnecessary, as the where clause and group by clause can be applied to the unwrapped query.
This where clause seems fishy to me:
c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7)
START_DT & END_DT are datetime or timestamp columns being implicitly cast as char. It would be better to extract the year-month using the function DATE_FORMAT as:
DATE_FORMAT(<FIELD>, '%Y-%m-01')
Even if the where clause you used worked, it would omit records for which END_DT and CRR_DT fall in the same month. I'm not sure if that is the desired behaviour, but here's a query to illustrate what your boolean expression would evaluate:
SELECT CAST('2015-07-05' AS DATETIME) between '2015-07' and '2015-07';
-- This query returns 0 == False.
Using CREATE TABLE AS SELECT Syntax, first de-nest the sub queries. Note: as I don't know the data, I'm not sure which indexes need to be unique. You can delete the tables once the result is consumed.
Table 1:
CREATE TABLE sub_a (KEY(CRR_DT), KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT CRR_DT,
TOU,
SRCE,
SINK,
NAME,
SEASON,
SRCESUMCONG,
SINKSUMCONG,
SRCEAVGCONG,
SINKAVGCONG,
SUMSINKMSRCE,
AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01-01';
Table 2:
CREATE TABLE sub_b (KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT MARKET,
TERM,
TOU,
SRCE,
SINK,
NAME,
SCID,
CTYPE,
START_DT,
END_DT,
SUM(MW) AS MW_AWARD,
SUBSTR(MARKET,1,3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL','LDES')
GROUP BY MARKET, TERM, TOU,
SRCE, SINK, NAME, SCID,
CTYPE, START_DT, END_DT, MARK
-- note the two added columns in the groupby clause.
After this, the final query would be simply:
SELECT a.CRR_DT,
a.TOU,
a.SRCE,
a.SINK,
a.NAME,
a.SEASON,
a.SRCESUMCONG,
a.SINKSUMCONG,
a.SRCEAVGCONG,
a.SINKAVGCONG,
a.SUMSINKMSRCE,
a.AVGSINKMSRCE,
a.HOURCOUNT,
b.TERM,
b.CTYPE,
b.START_DT,
b.END_DT,
b.MW_AWARD,
b.SCID,
b.Mark
FROM sub_a a
JOIN sub_b b ON a.NAME = b.NAME AND a.TOU = b.TOU
WHERE a.CRR_DT BETWEEN DATE_FORMAT(b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(b.END_DT,'%Y-%m-01')
ORDER BY NAME,
CRR_DT,
TOU;
The above where clause follows the same logic used in your query, except, it's not trying to cast to string. However, this WHERE clause may be more appropriate,
WHERE sub_a.CRR_DT BETWEEN DATE_FORMAT(sub_b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(DATE_ADD(sub_b.END_DT, INTERVAL 1 MONTH),'%Y-%m-01')
Finally both sub_a & sub_b seem to have the fields SRCE & SINK. Would the result change if you added them to the join. That could further optimize the query (at this point, its fair to say queries) processing time.
By doing the above, we hopefully avoid two full table scans, but I don't have your data set, so I'm only making an educated guess here.
If its possible to express this logic without using intermediary tables, and directly via joins to the actual underlying tables CRR_CONGCALC and CRR_INVENTORY, that would be even faster
Full table scans operations are not always bad, or necessarily evil. Sometimes, a full scan is the most efficient way to satisfy a query. For example, the query SELECT * FROM mytable requires MySQL to return every row in the table and every column in each row. And in this case, using an index would just make more work. It's faster just to do a full scan.
On the other hand, if you're retrieving a couple of rows out of a million, an access plan using a suitable index is very likely to be much faster than a full table scan. Effective use of a index can eliminate vast swaths of rows that would otherwise need to be checked; the index basically tells MySQL that the rows we're looking for cannot be in 99% of the blocks in the table, so those blocks don't need to be checked.
MySQL processes views (including inline views) differently than other databases. MySQL uses the term derived table for an inline view. In your query a, b and c are all derived tables. MySQL runs the query to return the rows, and then materializes the view into a table. Once that is completed, the outer query can run against the derived table. But as of MySQL 5.5 (and I think 5.6), inline views are always materialized as derived tables. And that's a performance killer for large sets. (Some performance improvements are coming in newer versions of MySQL, some automatic indexing.)
Also, predicates in the outer query do not get pushed down into the view query. That is, if we run a query like this:
SELECT t.foo
FROM mytable t
WHERE t.foo = 'bar'
MySQL can make use of an index with a leading column of foo to efficiently locate the rows, even if mytable contains millions of rows. But if we write the query like this:
SELECT t.foo
FROM (SELECT * FROM mytable) t
WHERE t.foo = 'bar'
We're essentially forcing MySQL to make a copy of mytable, running the inline view query, to populate a derived table, containing all rows from mytable. And once that operation is complete, the outer query can run. But now, there's no index on the foo column in the derived table. So we're forcing MySQL to do a full scan of the derived table, to look at every row.
If we need an inline view, then relocating the predicate to the inline view query will result in a much smaller derived table.
SELECT t.foo
FROM (SELECT * FROM mytable WHERE foo = 'bar') t
With that, MySQL can make use of the index on foo to quickly locate the rows, and only those rows are materialized into the derived table. The full scan of the derived table isn't as painful now, because the outer query needs to return every row. In this example, it would also be much better to replace that * (representing every column) with just the columns we need to return.
The resultset you specify could be returned without the unnecessary inline views. A query something like this:
SELECT c.crr_dt
, c.tou
, c.srce
, c.sink
, c.name
, c.season
, c.srcesumcong
, c.sinksumcong
, c.srceavgcong
, c.sinkavgcong
, c.sumsinkmsrce
, c.avgsinkmsrce
, c.hourcount
, b.term
, b.start_dt
, b.end_dt
, b.ctype
, b.mw AS mw_award
, b.scid
, b.mark
FROM CRR_CONGCALC c
JOIN ( SELECT i.market
, i.term
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
, SUM(i.mw) AS mw
, SUBSTR(i.market, 1, 3) AS mark
FROM CRR_INVENTORY i
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
GROUP
BY i.market
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
) b
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
ORDER
BY c.name
, c.crr_dt
, c.tou
NOTES: If start_dt and end_dt are defined as DATE, DATETIME or TIMESTAMP columns, then I'd prefer to write the predicate like this:
AND c.crr_dt BETWEEN DATE_FORMAT(b.start_dt,'%Y-%m') AND DATE_FORMAT(b.end_dt,'%Y-%m')
(I don't think there's any performance to be gained there; that just makes it more clear what we're doing.)
In terms of improving performance of that query...
If we're returning a small subset of rows from CRR_INVENTORY, based on the predicates:
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
Then MySQL would likely be able to make effective use of an index with leading columns of (coption,scid,start_dt). That's assuming that this represents a relatively small subset of rows from the table. If those predicates are not very selective, if we're really getting 50% or 90% of the rows in the table, the index is likely going to much less effective.
We might be able to get MySQL to make use of an index to satisfy the GROUP BY clause, without requiring a sort operation. To get that, we'd need an index with leading columns that match the columns listed in the GROUP BY clause.
The derived table isn't going to have an index on it, so for best peformance of the join operation, we want an index on the other table ) is materialized, then we are going to want a suitable index on the other table CRR_CONGCALC. We want the leading columns of that index to be used for the lookup of the matching rows, the predicates:
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
So, we want an index with leading columns of (name, tou, crr_dt) to be able to efficiently locate the matching rows.

Optimize query for table with hundreds of millions of rows

this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.
As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;

Missing Mysql-Opimization

I have precomputet some similarities (about 70 million) and want to find the similarities from one track to all other tracks. I only need the top-100-tracks that have the highest similarities. For my calculations i do this query about 15'000 times with different tracks as input. After a boot of the machine one calculation needs over 600 seconds for all 15k queries. After several runs, mysql has - i think - cached the indices so the complete run needs about 15 seconds. My only worries are: i have a very hight "Handler_read_rnd_nextDokumentation" value.
I have a MySQL table with this structure:
CREATE TABLE `similarity` (
`similarityID` int(11) NOT NULL AUTO_INCREMENT,
`trackID1` int(11) NOT NULL,
`trackID2` int(11) NOT NULL,
`tracksim` double DEFAULT NULL,
`timesim` double DEFAULT NULL,
`tagsim` double DEFAULT NULL,
`simsum` double DEFAULT NULL,
PRIMARY KEY (`similarityID`),
UNIQUE KEY `trackID1` (`trackID1`,`trackID2`),
KEY `trackID1sum` (`trackID1`,`simsum`),
KEY `trackID2sum` (`trackID2`,`simsum`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
I want to do very much queries on this. The queries look like this:
// simsum is a sum over tracksim, timesim, tagsim
(
SELECT similarityID, trackID2, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID1 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
UNION
(
SELECT similarityID, trackID1, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID2 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
ORDER BY simsum DESC
LIMIT 0,100
The query is quite fast and under 0.1 sec (previous question) but i'm worried about the very huge number in the status page. I thought i have set every index that i'm using in the query.
Handler_read_rndDokumentation 88,0 M
Handler_read_rnd_nextDokumentation 20,0 G
Is there anything "wrong"? Could i get the query even faster? Do i have to worry about the 20G ?
Thanks in advance
The first thing which is obviously wrong here is that you seem to be calculating a directional relationship between tuples - if f(a,b)===f(b,a) then you could simplify your system a lot by swapping around track1 and track2 where track1 is greater than track2 but retaining the existing primary key (and ignore collisions).
You're only halving the amount of data - so it won't be a huge performance increase.
There may be further scope for improving the performance but this is very much dependant on how frequently the data changes, more specifically, you should prune the records where similarity is not in the top 100.