Optimizing DISTINCT SQL query with OR conditions - mysql

I have the following SQL query:
SELECT DISTINCT business_key
FROM Memory
WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed'
What I try to achieve is to get all unique business keys that don't have a record concept=case AND attrib=status AND value=closed. Running this query in MySQL with 500 000 records with all unique business_keys is very slow: about 11 seconds.
I placed indices to the business_key column, to the concept, attrib and value columns. I also tried with a combined index to all three columns (concept, attrib, value) but the result is the same.
Here is a screenshot of the EXPLAIN EXTENDED command:
The interesting thing is that running the query without the distinct specifier results in a very fast execution.
I had also tried this:
SELECT DISTINCT m.business_key
FROM Memory m
WHERE m.business_key NOT IN
(SELECT c.business_Key
FROM Memory c
WHERE c.concept = 'case' AND c.attrib = 'status' AND c.value = 'closed')
with even worse results: around 25 seconds

You could add a compound (concept, attrib, value, business_key) index so the query (if MySQL decides to use this index) can find all the info in the index without having to read the whole table.
Your query is equivalent to:
SELECT DISTINCT business_key
FROM Memory
WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
and to this (which will probably yield the same execution plan):
SELECT business_key
FROM Memory
WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
GROUP BY business_key
Since the 4 columns that are to be put in the index are all VARCHAR(255), the index length will be pretty large. MyISAM will not allow more than 1000 bytes and InnoDB no more than 3072.
One solution is to cut the length of the last part, making the index length less than 1000: 255+255+255+230 = 995:
(concept, attrib, value, business_key(220))
It will work but it's really not good to have so large index lengths, performance wise.
Another option is to lower the length of all or some of those 4 columns, if that complies with the data you expect to store there. No need to declare length 255 if you expect to have maximum of 100 in a column.
Another option you may consider is putting those 4 columns in 4 separate reference tables. (Or just the columns that have repeated data. It seems that business_key will have duplicate data but not that many. So, it won't be much good to make a reference table for that column.)
Example: Put concept values in a new table with something like:
CREATE TABLE Concept_Ref
( concept_id INT AUTO_INCREMENT
, concept VARCHAR(255)
, PRIMARY KEY concept_id
, UNIQUE INDEX concept_idx (concept)
) ;
INSERT INTO Concept_Ref
( concept )
SELECT DISTINCT
concept
FROM
Memory ;
and then change the Memory table with:
ALTER TABLE Memory
ADD COLUMN concept_id INT ;
do this (once):
UPDATE
Memory m
JOIN
Concept_Ref c
ON c.concept = m.concept
SET m.concept_id = c.concept_id
and then drop the Memory.concept column:
ALTER TABLE Memory
DROP COLUMN concept ;
You can also add FOREIGN KEY references if you change your tables from MyISAM to InnoDB.
After doing the same for all 4 columns, not only the length of new compound index in the Memory table will be much smaller but your table size will be much smaller, too. Additionally, any other index that uses any of those columns will have smaller length.
Off course the query would need 4 JOINs to be written. And any INSERT, UPDATE or DELETE statement to this table will have to be changed and carefully designed.
But overall, I think you will have better performance. With the design you have now, it seems that values like 'case', 'status' and 'closed' are repeated many times.

This will allow the use of index. It will still take some time to retrieve all the rows.
SELECT DISTINCT business_key FROM Memory
WHERE NOT(concept = 'case' AND attrib AND 'status' AND value = 'closed')

If the query runs quickly without DISTINCT, have you tried:
SELECT DISTINCT business_key from
(SELECT business_key
FROM Memory
WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed') v
?

Related

How to optimize the following SELECT query

We have the following table
id # primary key
device_id_fk
auth # there's an index on it
old_auth # there's an index on it
And the following query.
$select_user = $this->db->prepare("
SELECT device_id_fk
FROM wtb_device_auths AS dv
WHERE (dv.auth= :auth OR dv.old_auth= :auth)
LIMIT 1
");
explain, I can't reach the server of the main client, but here's another client with fewer data
Since there's a lot of other updates queries on auth, update queries start getting written to the slow query log and the cpu spikes
If you remove the index from auth, then the select query gets written to the slow query log, but not the update, if you add an index to device_id_fk, it makes no difference.
I tried rewriting the query using union instead of or, but I was told that there was still cpu spike and the select query gets written to the slow query log still
$select_user = $this->db->prepare("
(SELECT device_id_fk
FROM wtb_device_auths
AS dv WHERE dv.auth= :auth)
UNION ALL
(SELECT device_id_fk
FROM wtb_device_auths AS dv
WHERE dv.old_auth= :auth)
LIMIT 1"
);
");
Explain
Most often, this is the only query in the slow query log. Is there a more optimal way to write the query? Is there a more optimal way to add indexes? The client is using an old MariaDB version, the equivalent of MYSQL 5.5, on a centos 6 server running LAMP
Additional info
The update query that gets logged to the slow query log whenever an index is added to auth is
$update_device_auth = $this->db->prepare("UPDATE wtb_device_auths SET auth= :auth WHERE device_id_fk= :device_id_fk");
Your few indexes should not be slowing down your updates.
You need two indexes to make both your update and select perform well. My best guess is you never had both at the same time.
UPDATE wtb_device_auths SET auth= :auth WHERE device_id_fk= :device_id_fk
You need an index on device_id_fk for this update to perform well. And regardless of its index it should be declared a foreign key.
SELECT device_id_fk
FROM wtb_device_auths AS dv
WHERE (dv.auth= :auth OR dv.old_auth= :auth)
LIMIT 1
You need a single combined index on auth, old_auth for this query to perform well.
Separate auth and old_auth indexes should also work well assuming there's no too many duplicates. MySQL will merge the results from the indexes and that merge should be fast... unless a lot of rows match.
If you also search for old_auth alone, add an index on old_auth.
And, as others have pointed out, the select query could return one of several matching devices with a matching auth or old_auth. This is probably bad. If auth and old_auth are intended to identify a device, add a unique constraint.
Alternatively, you need to restructure your data. Multiple columns holding the same value is a red flag. It can result in a proliferation of indexes, as you're experiencing, and also limit how many versions you can store. Instead, have just one auth per row and allow each device to have multiple rows.
create table wtb_device_auths (
id serial primary key,
device_id bigint not null references wtb_devices(id),
auth text not null,
created_at datetime not null default current_timestamp,
index(auth)
);
Now you only need to search one column.
select device_id from wtb_device_auths where auth = ?
Now one device can have many wtb_device_auths rows. If you want the current auth for a device, search for the newest one.
select device_id
from wtb_device_auths
where device_id = ?
order by created_at desc
limit 1
Since each device will only have a few auths, this is likely to be plenty fast with the device_id index alone; sorting the handful of rows for a device will be fast.
If not, you might need an additional combined index like created_at, device_id. This covers searching and sorting by created_at alone as well as queries searching and sorting by both created_at and device_id.
OR usually leads to a slow, full-table scan. This UNION trick, together with appropriate INDEXes is much faster:
( SELECT device_id_fk
FROM wtb_device_auths AS dv
WHERE dv.auth= :auth
LIMIT 1 )
UNION ALL
( SELECT device_id_fk
FROM wtb_device_auths AS dv
WHERE dv.old_auth= :auth
LIMIT 1 )
LIMIT 1
And have these "composite" indexes:
INDEX(auth, device_id)
INDEX(old_auth, device_id)
These indexes can replace the existing indexes with the same first column.
Notice that I had 3 LIMITs; you had only 1.
That UNION ALL involves a temp table. You should upgrade to 5.7 (at least); that version optimizes away the temp table.
A LIMIT without an ORDER BY gives a random row; is that OK?
Please provide the entire text of the slowlog entry for this one query -- it has info that might be useful. If "Rows_examined" is more than 2 (or maybe 3), then something strange is going on.

Issues with a MySQL Index due to a specific part

The query is
SELECT row
FROM `table`
USE INDEX(`indexName`)
WHERE row1 = '0'
AND row2 = '0'
AND row3 >= row4
AND (row5 = '0' OR row5 LIKE 'value')
I have the following MySQL Query which I've created a index for using;
CREATE INDEX indexName ON `table` (row1, row2, row3, row5);
However, the performance is not really good. It's extracting about 17,000+ rows out of a 5.9+ million row table in anywhere from 6-12 seconds.
It seems like the bottleneck is the row3 >= row4 - because without that part in the code it runs in 0.6-0.7 seconds.
(from Comment)
The row (placeholder column name) is actually the id (primary key, index) column in the table, which is the result set I'm outputting later on. I'm outputting an array of IDs that are matching the parameters in my query, and then selecting a random ID from that array to gather data through the final query on a specific row. This was done as a workaround for rand(). Any adjustments needed based on that knowledge?
17K rows is not a tiny result set. Large result sets often take time just because of the overhead of delivering the data from the MySQL server to the program requesting them.
The contents of the 'value' you use in row5 LIKE 'value' matter a great deal to query performance. If 'value' starts with a wildcard character like % your query will be slow.
That being said, you need a so-called covering index. You've tried to create one with the index you created. It's close but not perfect.
Your query filters on equality to constant values on row1, row2, and row5, so those columns should come first in your index. The query planner can random-access your index to the first matching entry, and then sequentially scan the index until it gets to the last matching entry. That's as fast as it gets.
Then you want to examine row3 and row4 (to compare them). Those columns should come next in the index. Finally, if your query's SELECT clause mentions a subset of the columns in your table you should put the rest of those columns in the index. So, based on the query in your question, your index should be
CREATE INDEX indexName ON `table` (row1, row2, row5, row3, row4, row);
The query planner will be able to satisfy the entire query by scanning through a subset of the index, using a so-called index range scan. That should be decently fast.
Pro tip: don't force the query planner's hand with USE INDEX(). Instead, structure your indexes to handle your queries efficiently.
An index can't be used to compare two columns in the same table (at best, it could be used for an index scan rather than a table scan if all output fields are contained in the index), so there basically is no "correct" way to do this.
If you have control over the structure AND the processes the fill the table, you could add a calculated field that holds the difference between the two fields. Then add that field to the index and adjust your query to use that field instead of the other 2.
It ain't pretty and doesn't offer a lot of flexibility (eg. if you want to compare another field, you need to add it as well etc), but it does get the job done.
(This is an adaptation of http://mysql.rjweb.org/doc.php/random )
Let's actually fold the randomization into the query. This will eliminate gathering a bunch of ids, processing them, and then reaching back into the table. It will also avoid the need for an extra index.
Find min and max id values.
Pick a random id between min and max.
Scan forward, looking for the first row with col1...col5 matching the criteria.
Something like...
SELECT b.* -- should replace with actual list of columns
FROM
( SELECT id
FROM tbl
WHERE id >= ( SELECT MIN(id) +
( MAX(id) - MIN(id)
- 22 -- somewhat avoids running off end
) * RAND()
FROM tbl )
AND col1 = 0 ... -- your various criteria
ORDER BY id
LIMIT 1
) AS a
JOIN tbl AS b USING(id);
Pros/cons:
Probably faster than anything else you can devise.
If there RAND() hits too late in the table, it will return nothing. In this (rare) case, run the query again, but starting at 0.
Big gaps in id will lead to a bias in which id is returned. (The link above discusses some kludges to handle such.)

SQL performance consideration when using joins for DB design

I have a table
name: order
with
id, product_id, comment
now I want to add a state
new table: order_state
1 -> finished
2 -> started
etc
then add a field order_state_id in the table order
in what way do I have to worry about performance?
does this always perform well or what is the case where it wont? e.g. i mean when doing joins etc with a lot of orders, say 200'000 orders
i have used mysql views before and they were horrible the view I created contained obviously several joins. Is this not a related problem?
Not an answer, just too big for a comment
In addition to what have been said, consider partial indexes.
Some DB like Postgres and SQL Server allows you to create indexes that not only specifies columns but rows.
It seems that you will end up with a constant growing amount of orders with order_state_id equal to finished (2) a stable amount of orders with order_state_id equal to started (1)
If your business make use of queries like this
SELECT id, comment
FROM order
WHERE order_state_id = 1
AND product_id = #some_value
Partial indexing allows you to limit the index, including only the unfinished orders
CREATE INDEX Started_Orders
ON order(product_id)
WHERE order_state_id = 1
This index will be smaller than the unfiltered contra part
Don't normalize order_state. Instead add this column
order_state ENUM('finished', 'started') NOT NULL
Then use it this way (for example):
SELECT ...
WHERE order_state = 'finished'
...
An ENUM (with up to 255 options) takes only 1 byte. INT takes 4 bytes. TINYINT takes 1 byte.
Back to your question... There are good uses of JOIN and there are unnecessary uses.

Very slow SQL ORDER BY, and EXPLAIN doesn't explain

The query:
SELECT
m.*,
mic.*
FROM
members m,
members_in_class_activities mic
WHERE
m.id = mic.member_id AND
mic.not_cancelled = '1' AND
mic.class_activity_id = '32' AND
mic.date = '2016-02-14' AND
mic.time = '11:00:00'
ORDER BY
mic.reservation_order
Table members has ~ 100k records, and table members_in_class_activities has about 300k. The result set is just 2 records. (Not 2k, just 2.)
All relevant columns are indexed (id and reservation_order are primary): member_id, not_cancelled, class_activity_id, date, time.
There is also a UNIQUE key for class_activity_id + member_id + date + time + not_cancelled. not_cancelled is NULL or '1'.
All other queries are very fast (1-5 ms), but this one is crazy slow: 600-1000 ms.
What doesn't help:
select only the 2 primary keys instead of * (0 % change)
a real JOIN instead of an implicit join (it actually seems slightly slower, but probably not) (0 % change)
removing the join to members entirely makes it slightly faster (15 % change)
What does help, immensely:
remove the ORDER BY on the primary key (99% change)
I have only 2 questions:
What????
How do I still ORDER BY x, but make it fast too? Do I really need a separate column?
I'm running 10.1.9-MariaDB on my dev machine, but it's slow on production's MySQL 5.5.27-log too.
Dont use order by on your main query. Try this :
SELECT * FROM (
... your query
) ORDER BY mic.reservation_order
As you mentioned that members_in_class_activities has about 300k records, so your order by will apply on all 300k records that definitely slow down your query.
class_activity_id + member_id + date + time + not_cancelled -- not optimal.
Start with the '=' fields in the WHERE in any order, then add on the ORDER BY field:
INDEX(class_activity_id, date, time, not_cancelled,
reservation_order)
Since you seem to need the UNIQUE constraint, then it would be almost as good to shuffle your index by putting member_id at the end (where it will be 'out of the way', but not used):
UNIQUE(class_activity_id, date, time, not_cancelled, member_id)
In general, it is bad to split a date and time apart; suggest a DATETIME column for the pair.
Cookbook on creating indexes.

MySQL: Large data slow reads

I have very large table with 17,044,833 Rows and 6.4 GB in size. I am running the simple query below and it takes like 5 seconds. Any ideas what optimizations can I do to improve the speed of this query?
SELECT
`stat_date`,
SUM(`adserver_impr`),
SUM(`adserver_clicks`)
FROM `dfp_stats` WHERE
`stat_date` >= '2014-02-01'
AND
`stat_date` <= '2014-02-28'
MySQL Config:
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
innodb_buffer_pool_size = 10G
Server:
Memory: 48GB
Disk: 480GB
UPDATE
ORIGINAL QUERY:
EXPLAIN
SELECT
DS.`stat_date` 'DATE',
DC.`name` COUNTRY,
DA.`name` ADVERTISER,
DOX.`id` ORDID,
DOX.`name` ORDNAME,
DLI.`id` LIID,
DLI.`name` LINAME,
DLI.`is_ron` ISRON,
DOX.`is_direct` ISDIRECT,
DSZ.`size` LISIZE,
PUBSITE.`id` SITEID,
SUM(DS.`adserver_impr`) 'DFPIMPR',
SUM(DS.`adserver_clicks`) 'DFPCLCKS',
SUM(DS.`adserver_rev`) 'DFPREV'
FROM `dfp_stats` DS
LEFT JOIN `dfp_adunit1` AD1 ON AD1.`id` = DS.`dfp_adunit1_id`
LEFT JOIN `dfp_adunit2` AD2 ON AD2.`id` = DS.`dfp_adunit2_id`
LEFT JOIN `dfp_adunit3` AD3 ON AD3.`id` = DS.`dfp_adunit3_id`
LEFT JOIN `dfp_orders` DOX ON DOX.`id` = DS.`dfp_order_id`
LEFT JOIN `dfp_advertisers` DA ON DA.`id` = DOX.`dfp_advertiser_id`
LEFT JOIN `dfp_lineitems` DLI ON DLI.`id` = DS.`dfp_lineitem_id`
LEFT JOIN `dfp_countries` DC ON DC.`id` = DS.`dfp_country_id`
LEFT JOIN `dfp_creativesize` DSZ ON DSZ.`id` = DS.`dfp_creativesize_id`
LEFT JOIN `pubsites` PUBSITE
ON AD1.`pubsite_id` = PUBSITE.`id`
OR AD2.`pubsite_id` = PUBSITE.`id`
WHERE
DS.`stat_date` >= '2014-02-01'
AND DS.`stat_date` <= '2014-02-28'
AND PUBSITE.`id` = 6
GROUP BY DLI.`id`,DS.`stat_date`;
RESULTS OF EXPLAIN: (This is after adding the COVERING INDEX)
http://i.stack.imgur.com/vhVeB.png
If you haven't, you might want to index the stat_date field for faster lookups. Here's the syntax:
ALTER TABLE TABLE_NAME ADD INDEX (COLUMN_NAME);
Read more about indexing and optimizations here: https://dev.mysql.com/doc/refman/5.5/en/optimization-indexes.html
For best performance of this query, create a covering index:
... ON `dfp_stats` (`stat_date`,`adserver_impr`,`adserver_clicks`)
The output from EXPLAIN should show "Using index". This means that the query can be satisfied entirely from the index, without needing to visit any pages in the underlying table. (The term "covering index" refers to an index that includes all of the columns referenced by a query.)
At a minimum, you'll want an index with a leading column of stat_date so that the query can use an index range scan operation. An index range scan can essentially skip over boatloads of rows, and more quickly locate the rows that actually need to be checked.
As far as changes to the configuration of the MySQL instance, that really depends on whether the table is InnoDB or MyISAM.
FOLLOWUP
For InnoDB, memory is still king. If there's memory available on the server, then you can increase innodb_buffer_pool.
Also consider enabling the MySQL query cache. (We have the query cache enabled only for queries that are specifically enabled to use the cache with the SQL_CACHE keyword i.e. SELECT SQL_CACHE t.foo,, so we don't clutter up the cache with queries that don't give us benefit. For other queries, we avoid running the extra code (that would otherwise be required) to search the cache and maintain the cache contents.
The place we get a benefit from the query cache is from "expensive" queries (which look at a lot of rows and do a lot of joins) against tables that are relatively static, and that return small resultsets. (I'd consider a query that gets a single row with a SUMs from a whole boatload of rows would be a good candidate for the query cache, if the table is infrequently updated, or if the same query is going to be run several times before a DML operation on the table invalidates the cache.)
It's a bit odd that your query is returning a non-aggregate that isn't in a GROUP BY clause.
If your query is using an index on stat_date, it's likely the query is returning the lowest value of stat_date within the range specified by the predicate; so it's likely that you would get an equivalent result using SELECT MIN(stat_date) AS stat_date.
A more complicated approach would be to setup a "summary" table, and refresh that periodically with the results from a query, and then have the application query the summary table. (A data warehouse type approach.) This doesn't work if you need "up-to-the-minute" accuracy. To get that, you'd likely need to introduce triggers on the target table, to maintain the summary table on INSERT, UPDATE and DELETE operations.
If I went that route, I'd probably opt for storing a summary row for each stat_date, so it could accommodate queries on any range or set of dates...
CREATE TABLE dfp_stats_summary
( stat_date DATE NOT NULL PRIMARY KEY
, adserver_impr BIGINT
, adserver_clicks BIGINT
) ENGINE=InnoDB ;
-- refresh
INSERT INTO dfp_stats_summary (stat_date, adserver_impr, adserver_clicks)
SELECT t.stat_date
, SUM(t.adserver_impr) AS adserver_impr
, SUM(t.adserver_clicks) AS adserver_clicks
FROM dfp_stats
GROUP BY t.stat_date
ON DUPLICATE KEY
UPDATE adserver_impr = VALUES(adserver_impr)
, adserver_clicks = VALUES(adserver_clicks)
;
The refresh query will crank; you might want to specify a date range in a WHERE clause to do a month or two at a time, and loop through all the possible months.
With the summary table populated, just change the original query to reference the new summary table, rather than the detail table. It would be a lot faster to add up 28 summary rows than several hundred thousands detail rows.