I'm having a hard time figuring how to query/index a database.
The situation is pretty simple. Each time a user visits a category, his/her visit date is stored. My goal is to list the categories in which elements have been added after the user's latest visit.
Here are the two tables:
CREATE TABLE `elements` (
`category_id` int(11) NOT NULL,
`element_id` int(11) NOT NULL,
`title` varchar(255) NOT NULL,
`added_date` datetime NOT NULL,
PRIMARY KEY (`category_id`,`element_id`),
KEY `index_element_id` (`element_id`)
)
CREATE TABLE `categories_views` (
`member_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
`view_date` datetime NOT NULL,
PRIMARY KEY (`member_id`,`category_id`),
KEY `index_element_id` (`category_id`)
)
Query:
SELECT
categories_views.*,
elements.category_id
FROM
elements
INNER JOIN categories_views ON (categories_views.category_id = elements.category_id)
WHERE
categories_views.member_id = 1
AND elements.added_date > categories_views.view_date
GROUP BY elements.category_id
Explained:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: elements
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 89057
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories_views
type: eq_ref
possible_keys: PRIMARY,index_element_id
key: PRIMARY
key_len: 8
ref: const,convert.elements.category_id
rows: 1
Extra: Using where
With about 100k rows in each table, the query is taking around 0.3s, which is too long for something that should be executed for every user action in a web context.
If possible, what indexes should I add, or how should I rewrite this query in order to avoid using filesorts and temporary tables?
If each member has a relatively low number of category_views, I suggest testing a different query:
SELECT v.*
FROM categories_views v
WHERE v.member_id = 1
AND EXISTS
( SELECT 1
FROM elements e
WHERE e.category_id = v.category_id
AND e.added_date > v.view_date
)
For optimum performance of that query, you'd want to ensure you had indexes:
... ON elements (category_id, added_date)
... ON categories_views (member_id, category_id)
NOTE: It looks like the primary key on the categories_views table may be (member_id, category_id), which means an appropriate index already exists.
I'm assuming (as best as I can figure out from the original query) is that the categories_views table contains only the "latest" view of the category for a user, that is, member_id, category_id is unique. It looks like that has to be the case, if the original query is returning a correct result set (if its only returning categories that have "new" elements added since the "last view" of that category by the user; otherwise, the existence of any "older" view_date values in the categories_views table would trigger the inclusion of the category, even if there were a newer view_date that was later than the latest (max added_date) element in a category.
If that's not the case, i.e. (member_id,category_id) is not unique, then the query would need to be changed.
The query in the original question is a bit puzzling, it references element_views as a table name or table alias, but that doesn't appear in the EXPLAIN output. I'm going under the assumption that element_views is meant to be a synonym for categories_views.
For the original query, add a covering index on the elements table:
... ON elements (category_id, added_date)
The goal there is to get the explain output to show "Using index"
You might also try adding an index:
... ON categories_views (member_id, category_id, added_date)
To get all the columns from the categories_view table (for the select list), the query is going to have to visit the pages in the table (unless there's an index that contains all of those columns. The goal would be reduce the number of rows that need to be visited on data pages to find the row, by having all (or most) of the predicates satisfied from the index.
Is it necessary to return the category_id column from the elements table? Don't we already know that this is the same value as in the category_id column from the categories_views table, due to the inner join predicate?
Related
This is my table:
CREATE TABLE `e_relationship` (
`OID` int(11) NOT NULL AUTO_INCREMENT,
`E_E_OID` int(11) NOT NULL,
`E_E_OID2` int(11) NOT NULL,
`REL_DISPLAY` text NOT NULL,
`APP_OID` int(11) NOT NULL,
`META_OID` int(11) NOT NULL,
`STORE_DATE` datetime NOT NULL,
`UID` int(11) DEFAULT NULL,
PRIMARY KEY (`OID`),
KEY `Left_Entity` (`E_E_OID`),
KEY `Right_Entity` (`E_E_OID2`),
KEY `Meta_Left` (`META_OID`,`E_E_OID`),
KEY `Meta_Right` (`META_OID`,`E_E_OID2`)
) ENGINE=InnoDB AUTO_INCREMENT=310169 DEFAULT CHARSET=utf8;
The following query takes about 2.5-3ms, the result set is 1,290 rows, and the total number of rows in the table is 1,008,700:
SELECT * FROM e_relationship WHERE e_e_oid=#value1 OR e_e_oid2=#value1
This is result of EXPLAIN:
id: 1
select_type: SIMPLE
table: e_relationship
type: index_merge
possible_keys: Left_Entity,Right_Entity
key: Left_Entity,Right_Entity
key_len: 4,4
ref: NULL
rows: 1290
Extra: Using union(Left_Entity,Right_Entity); Using where
I would like to speed up this query as this is being quite critical in my system, I'm not sure if I'm reaching some sort of bottleneck in mysql as the number of records has already passed one million, and would like to know about other possible strategies to improve performance.
Sometimes MySQL has trouble optimizing OR queries. In this case, you can split it up into two queries using UNION:
SELECT * FROM relationship WHERE e_e_oid = #value1
UNION
SELECT * FROM relationship WHERE e_e_oid2 = #value2
Each subquery will make use of the appropriate index, and then the results will be merged.
However, in simple cases MySQL can automatically perform this transformation, and it's doing so in your query. That's what Using union in the EXPLAIN output means.
Have a table of E_E_OIDs. It would have (up to) 2 rows per e_relationship. Then
SELECT ...
FROM eeoids AS ee
JOIN e_relationship AS er ON er.ooid = ee.oid
And change from 2 columns to one. And get rid of 2 of the indexes.
Does the effect of the OR even faster than UNION.
(Remaining details left to reader.)
Analogy
Suppose a person has several phone numbers (landline, cell, work, etc), and I want to look up his record based on a single phone number. So, I build a table of phone:personID pairs and index that on phone. Then I do
SELECT ...
FROM Phones AS nums
JOIN Persons AS x ON x.personID = nums.personID
WHERE nums.phone = '123-555-1212'
Note: There is no phone number in the Persons table.
Your E_E_OID is like my phone.
Table structure, indexes, and query are below. On a table with more than million records, this takes well over a minute to run. I guess mainly because of GROUP BY and / or DAY().
I tried creating a composite index with the draft column first, because that would allow faster querying of WHERE draft = 0. Unfortunately it doesn't seem to make a difference and I haven't been able to find much information at all on how to use indexes to optimise this kind of query with a GROUP BY.
How can I speed this up?
CREATE TABLE `table` (
`id` bigint(20) UNSIGNED NOT NULL,
`user_id` int(11) NOT NULL,
`coords` point NOT NULL,
`date` datetime NOT NULL,
`draft` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `table`
ADD PRIMARY KEY (`id`),
ADD KEY `user_id` (`user_id`),
ADD KEY `draft` (`draft`),
ADD KEY `date` (`date`),
ADD KEY `user_id_2` (`draft`,`user_id`,`date`) USING BTREE;
SELECT id, user_id, date, coords
FROM table
WHERE draft = 0
GROUP BY (user_id, DAY(date))
ORDER BY date ASC
EXPLAIN
select type: simple
table: table
type: ref
possible_keys: draft, user_id_2
key: draft
key_len: 1,
ref: const
rows: 3427592
extra: Using where; Using temporary; Using filesort
First of all, I don't see any reason why you are even using GROUP BY, given that your select clause does not include any aggregates. Perhaps this is what you are intending to run:
SELECT id, user_id, date, coords
FROM yourTable
WHERE draft = 0
ORDER BY date;
This query might benefit from the following index:
CREATE INDEX idx ON yourTable (draft, date);
If used, this index would let MySQL discard some records via the WHERE clause, and also would enable efficient sorting in the ORDER BY date clause. You could also go all out, and cover the select clause:
CREATE INDEX idx ON yourTable (draft, date, user_id);
This would cover the entire select clause, meaning that the index by itself would contain all information needed to complete the query (this assumes you are running InnoDB, in which case MySQL would automatically include id as the fourth column of the index).
First, sorry if the used terms are not right. I'm not a mySQL-professional.
I have a table like this :
CREATE TABLE `accesses` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`time` int(11) DEFAULT NULL,
`accessed_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_accesses_on_accessed_at` (`accessed_at`)
) ENGINE=InnoDB AUTO_INCREMENT=9278483 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
This table has 10.000.000 rows in it. I use it to generate charts, with queries like this :
SELECT SUM(time) AS value, DATE(created_at) AS date
FROM `accesses`
GROUP BY date;
This query is very long (more than 1 minute). I'm doing lots of others queries (with AVG, MIN or MAX instead of SUM, or with a WHERE on a specific day or month, or GROUP BY HOUR(created_at), etc...)
I want to optimize it.
The best idea I have is to add several columns, with redundancy, like DATE(created_at), HOUR(created_at), MONTH(created_at), then add an index on it.
... Is this solution good or is there any other one ?
Regards
Yes, it can be an optimization to store data redundantly in permanent columns with an index to optimize certain queries. This is one example of denormalization.
Depending on the amount of data and the frequency of queries, this can be an important speedup (#Marshall Tigerus downplays it too much, IMHO).
I tested this out by running EXPLAIN:
mysql> explain SELECT SUM(time) AS value, DATE(created_at) AS date FROM `accesses` GROUP BY date\G *************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: accesses
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: Using temporary; Using filesort
Ignore the fact that the table is empty in my test. The important part is Using temporary; Using filesort which are expensive operations, especially if your temp table gets so large that MySQL can't fit it in memory.
I added some columns and indexes on them:
mysql> alter table accesses add column cdate date, add key (cdate),
add column chour tinyint, add key (chour),
add column cmonth tinyint, add key (cmonth);
mysql> explain SELECT SUM(time) AS value, cdate FROM `accesses` GROUP BY cdate\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: accesses
partitions: NULL
type: index
possible_keys: cdate
key: cdate
key_len: 4
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
The temporary table and filesort went away, because MySQL knows it can do an index scan to process the rows in the correct order.
I have two tables, app and pricehistory
there is a primary index id on app which is an int
on pricehistory i have two fields id_app (int), price(float) and dateup (date) and an unique index on "id_app, dateup"
i'm trying to get the latest (of date) price of an app :
select app.id,
( select price
from pricehistory
where id_app=app.id
order by dateup desc limit 1)
from app
where id=147
the explain select is kind of weird because it return 1 row but it still makes a filesort :
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY app const PRIMARY PRIMARY 4 const 1
2 DEPENDENT SUBQUERY pricehistory ref id_app,id_app_2,id_app_3 id_app 4 const 1 Using where; Using filesort
why does it need to filesort when there is only 1 row ? and why it's file sorting when i'm indexing all it need (id_app and dateup)
app has 1 million rows and i'm using innodb
edit: a sql fiddle explaining the problem:
http://sqlfiddle.com/#!2/085027/1
edit3 :
a new fiddle with another request with the same problem :
http://sqlfiddle.com/#!2/f7682/6
edit4: this fiddle ( http://sqlfiddle.com/#!2/2785c/2 ) shows that the query proposed doesn't work because it select all the data from pricehistory just to fetch the ones i want
Here's a quick rule of thumb for which order columns should go in an index:
Columns referenced in the WHERE clause with an equality condition (=).
Choose one of:
a. Columns referenced in the ORDER BY clause.
b. Columns referenced in a GROUP BY clause.
c. Columns referenced in the WHERE clause with a range condition (!=, >, <, IN, BETWEEN, IS [NOT] NULL).
Columns referenced in the SELECT-list.
See How to Design Indexes, Really.
In this case, I was able to remove the filesort with this index:
mysql> alter table pricehistory add key bk1 (id_app, dateup, price_fr);
And here's the EXPLAIN, showing no filesort, and the improvement of "Using index":
mysql> explain select price_fr from pricehistory where id_app=1 order by dateup desc\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: pricehistory
type: ref
possible_keys: bk1
key: bk1
key_len: 4
ref: const
rows: 1
Extra: Using where; Using index
You can make this index UNIQUE if you want to.
I had to drop the other unique keys, to avoid confusing the optimizer.
The two UNIQUE KEYs are causing the problem. I changed your fiddle to the following, and it works without a filesort:
CREATE TABLE IF NOT EXISTS `pricehistory` (
`id_app` int(10) unsigned NOT NULL,
`dateup` date NOT NULL,
`price_fr` float NOT NULL DEFAULT '-1',
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
UNIQUE KEY `id_app` (`id_app`,`dateup`,`price_fr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=345 ;
INSERT INTO pricehistory
(id_app, price_fr,dateup)
VALUES
('1', '4.99', now()),
('2', '0.45', now());
The EXPLAIN gives:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 SIMPLE pricehistory ref id_app id_app 4 const 1 Using where; Using index
There's no reason to use a UNIQUE KEY on both (id_app,dateup) and (id_app,price_fr,dateup), as they are redundant. I'm pretty confident that redundancy is making MySQL somehow uncertain of itself so that it errs on the side of doing a filesort.
The solution is to remove the unique to one of the indexes. It seems like if it's not useful it's better to not put the unique keyword.
thanks to both of you
edit:
damn, with a different query with 2 tables, the filesort is back :
http://sqlfiddle.com/#!2/f7682/6
I'm trying to optimize a MySQL table for faster reads. The ratio of read to writes is about 100:1 so I'm disposed to sacrifice write performances with multi indexes.
Relevant fields for my table are the following and it contains about 200000 records
CREATE TABLE `publications` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
-- omitted fields
`publicaton_date` date NOT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
`position` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
-- these are just attempts, they are not production index
KEY `publication_date` (`publication_date`),
KEY `publication_date_2` (`publication_date`,`position`,`active`)
) ENGINE=MyISAM;`enter code here`
Since I'm using Ruby on Rails to access data in this table I've defined a default scope for this table which is
default_scope where(:active => true).order('publication_date DESC, position ASC')
i.e. every query in this table by default will be completed automatically with the following SQL fragment, so you can assume that almost all queries will have these conditions
WHERE `publications`.`active` = 1 ORDER BY publication_date DESC, position
So I'm mainly interested in optimize this kind of query, plus queries with publication_date in the WHERE condition.
I tried with the following indexes in various combinations (also with multiple of them at the same time)
`publication_date`
`publication_date`,`position`
`publication_date`,`position`,`active`
However a simple query as this one still doesn't use the index properly and uses filesort
SELECT `publications`.* FROM `publications`
WHERE `publications`.`active` = 1
AND (id NOT IN (35217,35216,35215,35218))
ORDER BY publication_date DESC, position
LIMIT 8 OFFSET 0
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: publications
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 34903
Extra: Using where; Using filesort
1 row in set (0.00 sec)
Some considerations on my issue:
According to MySQL documentation a composite index can't be used for ordering when you mix ASC and DESC in ORDER BY clause
active is a boolean flag, so put it in a standalone index make no sense (it has just 2 possible values) but it's always used in WHERE clause so it should appear somewhere in an index to avoid Using where in Extra
position is an integer with few possible values and it's always used scoped to publication_date so I think it's useless to have it in a standalone index
Lot of queries uses publication_date in the where part so it can be useful to have it also in a standalone index, even if redundant and it's the first column of the composite index.
One problem is that your are mixing sort orders in the order by clause. You could invert your position (inverted_position = max_position - position) so that you may also invert the sort order on that column.
You can then create a compound index on [publication_date, inverted_position] and change your order by clause to publication_date DESC, inverted_position DESC.
The active column should most likely not be part of the index as it has a very low selectivity.