Create index to optimize slow query - mysql

There is a query that takes too long on a 250,000 rows table. I need to speed it up:
create table occurrence (
occurrence_id int(11) primary key auto_increment,
client_id varchar(16) not null,
occurrence_cod varchar(50) not null,
entry_date datetime not null,
zone varchar(8) null default null
)
;
insert into occurrence (client_id, occurrence_cod, entry_date, zone)
values
('1116', 'E401', '2011-03-28 18:44', '004'),
('1116', 'R401', '2011-03-28 17:44', '004'),
('1116', 'E401', '2011-03-28 16:44', '004'),
('1338', 'R401', '2011-03-28 14:32', '001')
;
select client_id, occurrence_cod, entry_date, zone
from occurrence o
where
occurrence_cod = 'E401'
and
entry_date = (
select max(entry_date)
from occurrence
where client_id = o.client_id
)
;
+-----------+----------------+---------------------+------+
| client_id | occurrence_cod | entry_date | zone |
+-----------+----------------+---------------------+------+
| 1116 | E401 | 2011-03-28 16:44:00 | 004 |
+-----------+----------------+---------------------+------+
1 row in set (0.00 sec)
The table structure is from a commercial application and can not be altered.
What would be the best index(es) to optimize it? Or a better query?
EDIT:
It is the last occurrence of the E401 code for each client and only if the last occurrence is that code.

The ideal indexes for such a query would be:
index #1: [client_id] + [entry_date]
index #2: [occurence_cod] + [entry_date]
Nevertheless those indexes can be simplified if it happens that data have some characteristics. This will save file space, and also time when data are updated (insert/delete/update).
If there is rarely more than one "occurence" record for each [client_id], then index #1 can be only [client_id].
By the same way, if there is rarely more than one "occurence" record for each [occurence_cod], then index #1 can be only [occurence_cod].
It may be more useful to turn index #2 into [entry_date] + [occurence_cod]. This will enable you to use the index for criteria that are only on [entry_date].
Regards,

Unless you are truly trying to get the row with the max date, if and only if the occurrence_cod matches, this should work:
select client_id, occurrence_cod, entry_date, zone
from occurrence o
where occurrence_cod = 'E401'
ORDER BY entry_date DESC
LIMIT 1;
It will return the most recent row with occurrence_cod='E401'

select
a.client_id,
a.occurrence_cod,
a.entry_date,
a.zone
from occurrence a
inner join (
select client_id, occurence_cod, max(entry_date) as entry_date
from occurence
) as b
on
a.client_id = b.client_id and
a.occurence_cod = b.occurence_cod and
a.entry_date = b.entry_date
where
a.occurrence_cod = 'E401'
Using this approach you're avoiding the subselect per row, and it should be faster to compare two big sets of data than a big set of data for each row of the set.

I'd re-write the query:
select client_id, occurrence_cod, max(entry_date), zone
from occurrence
group by client_id, occurrence_cod, zone;
(assuming the other lines are indeed identical, and entry date is the only thing that changes).

Did you try putting an index on occurrence_cod?

Try this if other approaches not available.
create a new table: last_occurrence.
Every time user occurred, update the corresponding row in this last_occurrence table.
by doing this, you just need to use the following sql to get your result :)
select * from last_occurrence

Related

Best practice in MySQL for selecting two interchangeable columns and counting them, returning the most recent result

I have a MySQL table that looks like:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`from` varchar(12) NOT NULL,
`to` varchar(12) NOT NULL,
`message` text,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=latin1;
So each time a message is sent or received, it is stored as:
# id from, to, message, timestamp
'65', '+1231303****', '+1833935****', 'Showtimes', '2022-01-26 09:26:10'
'64', '+1833935****', '+1231303****', 'Showtimes are: 12:30 someresponse', '2022-01-26 09:26:10'
I want to create a index of these conversation threats, and need to be able to execute a query that selects the conversation based on it either being addressed from or to a specific number, and returns the number of rows that match either, while at the same time, returning the last message that was sent. So basically I want it to return:
recipient (the other phone number, not the one I'm using to look up ),count(messages),lastmessage
Individually, I can query this all separately, since most of my experience here resolves around using PHP to untangle the data I'm going after. What I'm curious about is a single query that lets MySQL handle this, rather than submitting multiple queries to the database server. I figure this may be a good time to approach in, since several projects I've coded have ran out of memory to process before with so many queries between so many loops.
Apologies in advance if this has been answered somewhere else already. I searched extensively for an answer, but the few results I found used a completely different table structure than I am using, and the MySQL query I was able to fumble together didn't work. I stand next to my work as a PHP programmer, but my MySQL needs some work. Hence I'm here!
If a conversation thread can be defined by a unique combination of from and to then creating a compound key where the first node is the lower of the two then all the conversations in the thread can be established , however selecting on from OR two means many conversation threads may be selected. for example
DROP TABLE IF EXISTS T;
CREATE TABLE T(ID INT AUTO_INCREMENT PRIMARY KEY, FROMNO INT, TONO INT);
INSERT INTO T(FROMNO,TONO) VALUES
(1,2),(2,1),
(1,3),(4,1),(1,2);
WITH CTE AS
(SELECT * ,
CASE WHEN FROMNO < TONO THEN CONCAT(FROMNO,TONO)
ELSE CONCAT(TONO,FROMNO)
END AS CVAL
FROM T
WHERE FROMNO = 1 OR TONO = 1
),
CTE1 AS
(SELECT *,
DENSE_RANK() OVER (ORDER BY CVAL) DR
FROM CTE
),
CTE2 AS
(SELECT CVAL,COUNT(*) conversations,MAX(ID) MAXID
FROM CTE1
GROUP BY CVAL
)
SELECT CTE2.CVAL,CTE2.THINGS,CTE2.MAXID,T.ID
FROM CTE2
JOIN T ON T.ID = CTE2.MAXID;
Yields
+------+---------------+-------+----+
| CVAL | conversations | MAXID | ID |
+------+---------------+-------+----+
| 13 | 1 | 3 | 3 |
| 14 | 1 | 4 | 4 |
| 12 | 3 | 5 | 5 |
+------+---------------+-------+----+
3 rows in set (0.002 sec)

Compare time between consequent rows using MySQL 5.5

This is my table tusers on MySQL 5.5.1 database community version
mysql> SELECT * FROM `tusers`;
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:54:42 | 1 |
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 07:13:09 | 3 |
| Controneri | 2022-01-06 | 06:31:52 | 4 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
5 rows in set (0.13 sec)
I need select from the table tusers only these rows
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
Because the other rows are repeated for the same user Controneri within one hour compared to the previous row.
Each user access to the web page is stored on the table tusers for date and time.
But I have to extract only the first access and exclude the repeated accesses in the time span of one hour.
On this example the user Controneri on January 6 he was logged in 5 times. But the valid accesses are those at 06:13:12 and 07:43:38, because after the access at 06:13:12 there were other accesses before 07:13:12, i.e. before the end of the hour compared to the hours 06:13:12 (06:31:52 and 07:13:09 , rows 4 and 3).
I have tried without success.
My table structure and the Select query below on db-fiddle.com, which offers MySQL 5
Any suggestion?
-- ----------------------------
-- Table structure for tusers
-- ----------------------------
DROP TABLE IF EXISTS `tusers`;
CREATE TABLE `tusers` (
`tIDUSER` varchar(255) NULL DEFAULT NULL,
`tDate` date NULL DEFAULT NULL,
`tHour` time NULL DEFAULT NULL,
`tID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`tID`) USING BTREE
) ENGINE = InnoDB;
-- ----------------------------
-- Records of tusers
-- ----------------------------
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:54:42', 1);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:43:38', 2);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:13:09', 3);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:31:52', 4);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:13:12', 5);
SELECT
a.tID,
a.tDate,
a.tHour,
a.tIDUSER,
TIMEDIFF( a.tHour, b.tHour ) AS tDif
FROM
`tusers` a
JOIN `tusers` b ON
a.tDate = b.tDate
AND a.tIDUSER = b.tIDUSER
AND a.tID > b.tID
WHERE
( TIMEDIFF( a.tHour, b.tHour ) BETWEEN '00:00:00' AND '01:00:00' )
ORDER BY
a.tIDUSER,
a.tDate,
a.tHour ASC;
For MySQL 5.5 you can achieve this by tracking the previous values in user variables -
SELECT tIDUSER, tDate, tHour, tID
FROM (
SELECT
tusers.*,
IF(#prev_date_time IS NULL OR #prev_user <> tIDUSER OR #prev_date_time + INTERVAL 1 HOUR < TIMESTAMP(tDate, tHour), #prev_date_time := TIMESTAMP(tDate, tHour), NULL) AS `show`,
#prev_user := tIDUSER
FROM tusers, (SELECT #prev_date_time := NULL, #prev_user := NULL) n
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC
) t
WHERE `show` IS NOT NULL
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC;
And here's a db<>fiddle. Thanks to sticky bit as I took the liberty of "borrowing" from their db<>fiddle.
The MySQL 5.6 manual states -
However, the order of evaluation for expressions involving user
variables is undefined.
And in later versions is extended to say -
The order of evaluation for expressions involving user variables is
undefined. For example, there is no guarantee that SELECT #a, #a:=#a+1
evaluates #a first and then performs the assignment.
The MySQL 5.7 manual also states -
It is also possible to assign a value to a user variable in statements
other than SET. (This functionality is deprecated in MySQL 8.0 and
subject to removal in a subsequent release.) When making an assignment
in this way, the assignment operator must be := and not = because the
latter is treated as the comparison operator = in statements other
than SET:
Despite the above warnings, this method has been widely used for many years. Your mileage may vary.
I suspect this will perform badly with larger result sets but give it a try.
As requested by the OP in the comments, here is a query using recursive CTEs which will be available with MySQL version 8 and higher.
WITH RECURSIVE
cte1
AS
(
SELECT tusers.tiduser,
tusers.tdate,
tusers.thour,
tusers.tid,
addtime(tusers.tdate, tusers.thour) AS sane_timestamp_representation,
row_number() OVER (PARTITION BY tusers.tiduser
ORDER BY addtime(tusers.tdate, tusers.thour) ASC) AS rn
FROM tusers
),
cte2
AS
(
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
0 AS n
FROM cte1
UNION ALL
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
cte2.n + 1 AS n
FROM cte2
INNER JOIN cte1
ON cte2.tiduser = cte1.tiduser
AND cte1.sane_timestamp_representation > adddate(cte2.sane_timestamp_representation, INTERVAL 1 HOUR)
),
cte3
AS
(
SELECT cte2.tiduser,
cte2.tdate,
cte2.thour,
cte2.tid,
cte2.sane_timestamp_representation,
row_number() OVER (PARTITION BY cte2.tiduser,
cte2.n
ORDER BY cte2.sane_timestamp_representation ASC) rn
FROM cte2
)
SELECT cte3.tiduser,
cte3.tdate,
cte3.thour,
cte3.tid
FROM cte3
WHERE cte3.rn = 1
ORDER BY cte3.tiduser ASC,
cte3.sane_timestamp_representation ASC;
db<>fiddle
1.
In cte1 we first and foremost unite that day and hour part of the timestamp (not the brightest idea to save them as two different columns; it'll become a mess when day boundaries have to be crossed). We also assign a row_number() rn according to the timestamp in ascending order per user. cte1 acts as our "base table" from now on.
2.
Now in cte2 the recursiveness happens. As anchor we take all the rows from cte1 where cte1.rn = 1. These are the records for a user with the minimum timestamp for that user. We also add some number n. For those initial anchor rows we set n to 0. n will act as an indicator which rows cannot cover each other. All rows with an n + x for x > 1 cannot be covered by any row with n (per user).
In the recursive step we join all records from cte1 past an hour per user. Since these cannot be covered by the records already in the result set (per user), they're past an hour, we assign n + 1 as n to them.
3.
cte3 adds another row_number() rn ordering the records by the timestamp ascending per user and n. Those with an rn of 1 aren't covered themselves by any previous record for the user because all others with equal or greater n have greater timestamps and those with lesser n don't cover them as per we constructed n. So we can select these records from cte3 where rn = 1 and get our final result.
One big fat warning though:
The intermediate result sets will grow rapidly! You can try to select from cte3 without a WHERE clause and see for yourself. So while this shows it can be done theoretically, it might not be practical, even for medium sets of data. The needed resources can quickly exceed maximums.
(And well, since AFAIK SQL with recursive CTEs is Turing complete and the problem seems well computable, it was clear that it can be done anyway. But it still was interesting to see an example how it can be done, I think.)
Maybe it can be optimized. The key, I believe, is to limit the joined rows in the recursive step. We actually only need to join the oldest record past an hour, that would be the next record of interest. That would also make cte3 and the WHERE in the final SELECT unnecessary (unless for projection to get rid of the helper columns). But I didn't find a way to do so. LIMIT as well as window functions aren't allowed or implemented for recursive CTEs, at least in the recursive step. But if somebody comes up with such an optimization, I'd love to see it!
Oh, and the stupid timestamp representation in two columns, which needs to be put together at first, will also render the use of indexes on the timestamps impossible. So that's another factor limiting performance here.

Mysql group by aggregation sort and limit [duplicate]

This question already has answers here:
Get records with max value for each group of grouped SQL results
(19 answers)
Closed 2 years ago.
I am trying to figure out a seemingly trivial SQL query.
For all users in the table I want to find the time and data for the row with the highest time (latest event).
The following almost solves it
SELECT user, MAX(time) as time FROM tasks GROUP BY user;
The problem is of course that the data column cannot be reduced. I think therefore I should use a WHERE or ORDER BY + LIMIT construction. But I am too far out of my domain here to know how this should be done properly. Any hints?
Note. It is not possible to use GROUP BY in this instance because I want to select on the table row ID, which cannot be aggregated, obviously.
-- MYSQL
DROP DATABASE IF EXISTS test;
CREATE DATABASE test;
USE test;
CREATE TABLE tasks (
id int AUTO_INCREMENT,
user varchar(100) NOT NULL,
time date NOT NULL,
data varchar(100) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO tasks (user, time, data) VALUES
("Kalle", "1970-01-01", "old news"),
("Kalle", "2020-01-01", "latest shit"),
("Pelle", "1970-01-01", "regular data");
-- Expected output
-- +----+-------+------------+--------------+
-- | id | user | time | data |
-- +----+-------+------------+--------------+
-- | 2 | Kalle | 2020-01-01 | latest shit |
-- | 3 | Pelle | 1970-01-01 | regular data |
-- +----+-------+------------+--------------+
-- 2 rows in set (0.00 sec)
You can filter with a subquery:
select t.*
from tasks t
where time = (select max(t1.time) from tasks t1 where t1.user = t.user)
This query would take advantage of a multi-column index on (user, time).
In MySQL 8.0, you can also solve this top-1-per-group with window functions:
select *
from (select t.*, row_number() over(partition by user order by time desc) rn from tasks t) t
where rn = 1

MySQL shows "possible_keys" but does not use it

I have a table with more than a million entries and around 42 columns. I am trying to run SELECT query on this table which takes a minute to execute. In order to reduce the query execution time I added an index on the table, but the index is not being used.
The table structure is as follows. Though the table has 42 columns I am only showing here those that are relevant to my query
CREATE TABLE `tas_usage` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`userid` varchar(255) DEFAULT NULL,
`companyid` varchar(255) DEFAULT NULL,
`SERVICE` varchar(2000) DEFAULT NULL,
`runstatus` varchar(255) DEFAULT NULL,
`STATUS` varchar(2000) DEFAULT NULL,
`servertime` datetime DEFAULT NULL,
`machineId` varchar(2000) DEFAULT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB AUTO_INCREMENT=2992891 DEFAULT CHARSET=latin1
The index that I have added is as follows
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (SERVERTIME,COMPANYID(20),MACHINEID(20),SERVICE(50),RUNSTATUS(10));
My SELECT Query
EXPLAIN SELECT DISTINCT t1.COMPANYID, t1.USERID, t1.MACHINEID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
EXPLAIN result is as follows
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | last_quarter | NULL | NULL | NULL | 1765296 | 15.68 | Using where; Using temporary |
| 1 | SIMPLE | INVL | NULL | ref | invalid_company_index | invalid_company_index | 502 | servicerunprod.t1.companyid | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
| 1 | SIMPLE | INVL_MAC_ID | NULL | eq_ref | machineId | machineId | 502 | servicerunprod.t1.machineId | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
Explanation of my Query
I want to select all the records from table TAS_USAGE
which are between date range(including) 1st October 2018 and 31st
Dec 2018 AND
which do not have columns COMPANYID and MACHINEID matching in
tables TAS_INVALID_COMPANYand TAS_INVALID_MACHINE AND
which do not contain values ('credentialtest%', 'webupdate%') in
SERVICE column and values ('Failed', 'Failed Failed', 'Failed
Success', 'Success Failed', '') in RUNSTATUS column
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00'
AND t1.SERVERTIME <= '2018-12-31 00:00:00'
is strange. It covers 3 months minus 1 day plus 1 second. Suggest you rephrase thus:
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
There are multiple possible reasons why the INDEX(servertime, ...) was not used and/or was not "useful" even if used:
If more than perhaps 20% of the table involved that daterange, using the index is likely to be less efficient than simply scanning the table. Using the index would involve bouncing between the index's BTree and the data's BTree.
Starting an index with a 'range' means that the rest of the index will not be used.
Index "prefixing" (foo(10)) is next to useless.
What you can do:
Normalize most of those string columns. How many "machines" do you have? Probably nowhere near 3 million. By replacing repeated strings with a small id (perhaps a 2-byte SMALLINT UNSIGNED with a max of 65K) will save a lot of space in this table. This, in turn, will speed up the query, and eliminate the desire for index prefixing.
If Normalizing is not practical because there really are upwards of 3 million distinct values, then see if shortening the VARCHAR. If you get it under 255, prefixing is no longer needed.
NOT IN is not optimizable. If you can invert the test and make it IN(...), more possibilities open up, such as INDEX(service, runstatus, servertime). If you have a new enough version of MySQL, I think the optimizer will hop around in the index on the two IN columns and use the index for the time range.
NOT IN ('credentialtest%', 'webupdate%') -- Is % part of the string? If you are using % as a wildcard, that construct will not work. You would need two LIKE clauses.
Reformulate the query thus:
SELECT t1.COMPANYID, t1.USERID, t1.MACHINEID
FROM TAS_USAGE t1
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
AND t1.SERVICE NOT IN ('credentialtest%', 'webupdate%')
AND t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed',
'Failed Success', 'Success Failed', '')
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_COMPANY WHERE companyId = t1.COMPANYID )
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_MACHINE WHERE MACHINEID = t1.MACHINEID );
If the trio t1.COMPANYID, t1.USERID, t1.MACHINEID is unique, then get rid of DISTINCT.
Since there are only 6 (of 42) columns being used in this query, building a "covering" index will probably help:
INDEX(SERVERTIME, SERVICE, RUNSTATUS, COMPANYID, USERID, MACHINEID)
This is because the query can be performed entirely withing the index. In this case, I deliberately put the range first.
Focussing on the date range, MySQL basically has two options :
read the complete table consecutively and throw away records that do not fit the date range
use the index to identify the records in the date range and then look up each record in the table (using the primary key) individually ("random access")
Consecutive reads are significantly faster than random access, but you need to read more data. There will be some break-even point at which using an index will become slower than just simply reading everything, and MySQL assumes this is the case here. If that's the right choice will largely depend on how correctly it guessed how many records are actually in the range. If you make the range smaller, it should actually use the index at some point.
If you know that (or want to test if) using the index is faster, you can force MySQL to use it with
... FROM TAS_USAGE t1 force index (last_quarter) LEFT JOIN ...
You should test it with different ranges, and if you generate your query dynamically, only force the index when you are decently certain (as MySQL will not correct you if you e.g. specify a range that would include all rows).
There is one important way around the slow random access to the table, although it unfortunately does not work with your prefixed index, but I mention it in case you can reduce your field sizes (or change them to lookups/enums). You can include every column that MySQL needs to evaluate the query by using a covering index:
An index that includes all the columns retrieved by a query. Instead of using the index values as pointers to find the full table rows, the query returns values from the index structure, saving disk I/O.
As mentioned, since in a prefixed index, part of the data is missing, those columns unfortunately cannot be used to cover though.
Actually, they also cannot be used for much at all, especially not to filter records before doing the random access, as to evaluate your where-condition for RUNSTATUS or SERVICE, the complete value is required anyway. So you could check if e.g. RUNSTATUS is very significant - maybe 99% of your records are in status 'Failed' - and in that case add an unprefixed filter for just
(SERVERTIME, RUNSTATUS) (and MySQL might even pick that index then on its own).
The distinct clause is the one that interferes with the index usage. Since the index cannot be used to help with the distinct, mysql decided against the use of index completely.
If you rearrange the order of fields in the select list, in the index, and in the where clause, mysql may decide to use it:
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (COMPANYID(20),MACHINEID(20), SERVERTIME, SERVICE(50),RUNSTATUS(10));
SELECT DISTINCT t1.COMPANYID, t1.MACHINEID, t1.USERID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
This way COMPANYID, MACHINEID fields become the leftmost fields in the distinct, where, and index - although the prefix may result in the index still to be discarded. You may want to consider reducing your varchar(255) fields.

Two SQL Queries - Performance difference?

I'm using MySQL with PDO in PHP and I have a SQL query, which works as expected. However, I care about performance and would like to know if I could improve my query. I'm also asking, because I want to gain some more background knowledge of SQL.
Let's say I have two tables that have a few equal fields (and some additional information, which are different in each table):
table `blog_comments`: id, userid (int) | timestamp (int) | content (varchar) | other
table `projects_comments`: id, userid (int) | timestamp (int) | content (varchar) | other
The field id is the primary key, userid + timestamp have an index in both tables, and timestamp is simply the unixtime with the length of 10 (integer).
As a simple spam protection, I block a user from submitting a new comment (no matter if blog, project or anything else) until 60 seconds have passed since his last comment. To achieve this, I get the latest timestamp of that user from all the comments tables.
This is my working query:
SELECT MAX(`last_timestamp`) AS `last_timestamp`
FROM
(
SELECT `userid`, max(`timestamp`) AS `last_timestamp`
FROM `blog_comments`
GROUP BY `userid`
UNION ALL
SELECT `userid`, max(`timestamp`) as `last_timestamp`
FROM `projects_comments`
GROUP BY `userid`
) AS `subquery`
WHERE `userid` = 1
LIMIT 0, 1;
As you can notice, I use GROUP BY inside the subqueries, and in the main query I simply filter the userid (in this case: 1). The advantage: I just need to pass the userid once as a parameter.
Now, I am interested into how SQL exactly works. I think it will be like this: SQL first performs the subqueries, groups all the existing rows by userid and returns the whole set to the main query, which then applies the where clause to find the required userid. This seems like a big leak of performance to me.
So I thought on slightly changing the query:
SELECT max(`last_timestamp`) AS `last_timestamp`
FROM
(
SELECT max(`timestamp`) AS `last_timestamp`
FROM `blog_comments`
WHERE `userid` = 1
UNION ALL
SELECT max(`timestamp`) as `last_timestamp`
FROM `projects_comments`
WHERE `userid` = 1
) AS `subquery`
LIMIT 0, 1
Now I have to pass the userid twice, and still the whole set of rows will be looked up for the given userid. I am not sure if this really improves the performance.
I don't have any large data amount yet to really test it, maybe I will do some test scenarios later. I would be really interested into knowing if there would be a difference, when there would be many data sets in those tables?
Would appreciate any ideas, information and tips, thanks in advance.
Edit:
MySQL explain of the first query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 4 Using where
2 DERIVED blog_comments range NULL userid 8 NULL 10 Using index for group-by
3 UNION projects_comments index NULL userid 12 NULL 6 Using index
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
MySQL explain of the second query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2
2 DERIVED NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
3 UNION NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
As an alternative approach...
SELECT 'It''s been more than 1 minute since your last post' As result
WHERE NOT EXISTS (
SELECT *
FROM blog_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
AND NOT EXISTS (
SELECT *
FROM projects_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
There will be a result if userid = 1 hasn't got a timestamped record within the last minute in either table.
You can also swap the logic around...
SELECT 'You''re not allowed to post just yet...' As result
WHERE EXISTS (
SELECT *
FROM blog_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
OR EXISTS (
SELECT *
FROM projects_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
This second option will probably be more efficient (EXISTS vs NOT EXISTS) but that's for you to test and prove ;)
The answer to your question is that the second should perform better in MySQL than the first, for exactly the reason you gave. MySQL will run the full group by on all the data and then select the one group.
You can see the different in execution paths by putting an explain in front of the query. That will give ou some idea of what the query is really doing.
If you have an index on user_id, timestamp, then the second query will run quite fast, only using the index. Even without an index, the second query would do a full table scan of the two tables -- and that is it. The first will do a full table scan and a file sort for the aggregation. The second takes longer.
If you wanted to pass in the userid only once, you could do something like:
select coalesce(greatest(bc_last_timestamp, pc_last_timestamp),
bc_last_timestamp, pc_last_timestamp
)
from (select (SELECT max(`timestamp`) FROM `blog_comments` bc where bc.userid = const.userid
) bc_last_timestamp,
(SELECT max(`timestamp`) FROM `projects_comments` pc where pc.userid = const.userid
) pc_last_timestamp
from (select 1 as userid) const
) t;
The query looks arcane but it should optimize similarly to your second one.