Are these mysql table index changes appropriate? - mysql

Because it takes forever to make index changes on my 40 million row table, I was hoping to get some feedback to make sure I do it right the first time.
Right now my "favorites" table has 3 indexes:
Primary auto-increment index on (id)
item_idx (item_id) - the id of the item that was favorited
faver_id_idx (faver_profile_id, id) - for displaying favorites from a particular user starting with the most recent.
To check to see if the user has "faved" a particular item I use this query:
SELECT id FROM favorites
WHERE item_id = '.mysql_real_escape_string($item_id).'
AND faver_profile_id = '.mysql_real_escape_string($user['id']).'
AND removed = 0
Which is doing an interect:
Using intersect(item_idx,faver_id_idx)
This seems pretty inefficient to me, so I'm considering the following index setup:
Primary auto-increment index on (id)
item_faver_idx (item_id, removed, faver_profile_id)
faver_id_idx (faver_profile_id, removed, id)
The benefits I see are:
I can check if a user has faved an item without doing an intersect or table sort.
The "removed" (tinyint) column is now part of the index.
Questions I have:
In the (item_id, removed, faver_profile_id) index is there any reason to have faver_profile_id come first instead? For instance, if I'm doing the following query..
SELECT items.*, users.*, favorites.item_id
FROM items
LEFT JOIN users ON (items.submitter_id = users.id)
LEFT JOIN favorites ON (items.id = favorites.item_id AND favorites.faver_profile_id = 56 AND favorites.removed = 0)
ORDER BY items.id desc LIMIT 26
Would it be better to have faver_profile_id come first in the index so that it can just jump to the right faver_profile_id section of the index instead of having to check multiple item_id sections, and then scanning for the faver_profile_id within each of those sections?
Does it make sense to have "removed" in the index if only 1-3% of rows have a removed value of 1? Basically, is a slightly more efficient table scan worth the extra index size?
Anything I'm overlooking?

Related

How can I combine these two tables so that I can sort with information on each table, but not get duplicate answers?

I have two tables. The first is named master_list. It has these fields: master_id, item_id, name, img, item_code, and length. My second table is named types_join. It has these fields: master_id and type_id. (There is a third table, but it is not being used in the queries. It is more for reference.) I need to be able to combine these two tables so that I can sift the results to only show certain ones but part of the information to sift is on one table and the other part is on the other one. I don't want duplicate answers.
For example say I only want items that have a type_id of 3 and a length of 18.
When I use
SELECT * FROM master_list LEFT JOIN types_join ON master_list.master_id=types_join.master_id WHERE types_join.type_id = 3 AND master_list.length = 18"
it finds the same thing twice.
How can I query this so I won't get duplicate answers?
Here are the samples from my tables and the result I am getting.
This is what I get with an INNER JOIN:
BTW, master_id and name both only have unique information on the master_list table. However, the types_join table does use the master_id multiple times later on, but not for Lye. That is why I know it is duplicating information.
If you want unique rows from master_list, use exists:
SELECT ml.*
FROM master_list ml
WHERE ml.length = 18 AND
EXISTS (SELECT 1
FROM types_join tj
WHERE ml.master_id = tj.master_id AND tj.type_id = 3
);
Any duplicates you get will be duplicates in master_list. If you want to remove them, you need to provide more information -- I would recommend a new question.
Thank you for the data. But as you can see enter link description here, there is nothing wrong with your query.
Have you tried create an unique index over master_id, just to make sure that you do not have duplicated rows?
CREATE UNIQUE INDEX MyMasterUnique
ON master_list(master_id);

Efficient indexes for lookup table

I'm trying to understand the proper way to assign indexes on a lookup table. Given the following tables and sample query, what are the most efficient primary/additional indexes for the lookup table?
Table: items (id, title, etc.)
Table: categories (id, title, etc.)
Table: lookup (category_id, item_id, type, etc.)
SELECT * FROM items
INNER JOIN lookup ON
lookup.item_id=items.id AND lookup.type="items"
INNER JOIN categories ON
categories.id=lookup.category_id;
For this query:
SELECT *
FROM items i JOIN
lookup l
ON l.item_id = i.id AND l.type = 'items' JOIN
categories c
ON c.id = l.category_id;
The best indexes are probably:
lookup(type, item_id)
categories(id) (probably there already if id is a primary key)
items(id) (probably there already if id is a primary key)
Under some circumstances, this may not be a big improvement, particularly if most lookup() rows have a type of "items".
Apart from the join predicates your query only has a single filering precate (lookup.type = "items"). If this predicate has a good selectivity (i.e. it selects 5% or less of the rows) then you should use it as the first column of the index. I would do:
create index ix1 on lookup (type, item_id, category_id)
If the id columns on the table items and categories represent the primary keys, then there's nothing else to do.
The engine will probably read the lookup table using the index, and then will read the other two tables using their PK indexes.
Do not have an auto_incr id for the mapping table.
Have
PRIMARY KEY(type, item_id, category_id),
INDEX(category_id, type, item_id)
For the second index, will you need type when going from a category to an item? If not, leave it out.
More: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

Enhancing performance of SQL query

I am running a query on three tables messages, message_recipients and users.
Table structure of messages table:
id int pk
message_id int
message text
user_id int
...
Index for this table is on user_id, message_id and id.
Table structure of message_recipients table:
id int pk
message_id int
read_date datetime
user_id int
...
Index is on id, message_id and user_id.
Table structure of users table:
id int pk
display_name varchar
...
Index is on id.
I am running the following query against these tables:
SELECT
m.*,
if(m.user_id = 0, 'Campus Manager', u.display_name) AS name,
mr.read_date,
IF(m1.message_id > 0 and m1.user_id=1, true, false) as replied
FROM
messages m
JOIN
message_recipients mr
ON
mr.message_id = m.id
LEFT JOIN
users u
ON
u.UID = m.user_id
LEFT JOIN
messages m1
ON
m1.message_id = m.id
WHERE
mr.user_id = 1
AND
m.published = 1
GROUP BY
mr.message_id
ORDER BY
m.created DESC
EXPLAIN returns the following data for this query:
UPDATE
As suggested by #e4c5, I added new composite index on (published,user_id,created) and now the explain query shows this:
How can this query be optimized by adding required indexes (if any) as it is taking lot of time?
GROUP BY needs to list all the non-aggregated columns. I suspect that would be a mess. Why do you need GROUP BY at all?
Why are you linking messages.id to messages_id? Is this a hierarchical table, but the column names aren't like 'parent_id'?
"Index is on id, message_id and user_id" -- is that one composite index or 3 single-column indexes? (It makes a big difference.) It would be better to show us SHOW CREATE TABLE instead of ambiguously paraphrasing.
Is user_id=1 prolific? That is, are you expecting thousands of rows? Is this query only a problem for him?
Using LEFT JOIN implies that m1.message_id could be NULL, yet the reference to it seems to ignore that possibility.
If this is a single table that contains a message thread -- both the main info about the thread and the individual responses, then I suggest it is a bad design. (I made this mistake once upon a time.) I think it iis better to have a table with one row per thread and another table with one row per comment. 1 thread : many comments. So there would be a thread_id in the comment table.
I was able to bring down the query time from 3 seconds to 0.1 second by adding a new index to messages and message_recipients table and changing the database engine of messages table to MyISAM from InnoDB.
Composite index composite added on these columns with respective order on messages table - published, user_id, created
Composite index message_id_2 added on two columns on message_recipients table - message_id, user_id
EXPLAIN Query now shows

Optimal MySQL table schema for given use case

I have two tables - books and images. The books table has many columns - including id (primary key), name (which is not unique), releasedate, etc. The images table have two columns - id (which is not unique, i.e one book id may have multiple images associated with it, and we need all those images. This column has a non-unique index), and poster (which is unique primary key, all images lie in the same bucket, hence cannot have duplicate names). My requirement is given a book name, find all images associated with it (along with the year of release and the bucketname for each image, the bucketname being just a number in this case).
I am running this query:
select books.id,poster,bucketname,year(releasedate) from books
inner join images where images.bookId = books.id and books.name = "<name>";
A sample result set may look like this:
As you can see there are two results matching - one with id 2 and year 1989, having 5 images, other one with id 261009, year 2013 and one image.
The problem is, the query is extremely slow. It takes around .14 seconds from MySQL console itself, under zero load (in production there may be several concurrent requests and they may be queued, leading to further delay), which is unacceptable for autocomplete. Can anyone tell me how to optimize the query by adding correct indices/keys to the tables? If it is not possible from MySQL, suggestions regarding a proper Redis schema would be useful as well.
Edit: Approx no. of rows in images - 480k, in books - 285k. In future, autocomplete will show result for book authors as well as book names, hence the query will need to expand to take into account a separate table authors where each author will have an id and name, just like a book.
For optimal performance, you want suitable covering indexes available. For example:
... on `books` (`name`,`id`,`releasedate`)
... on `images` (`bookid`,`poster`,`bucketname`)
We want name as the leading column in the index, because of the equality predicate in the WHERE clause. We want id and releasedate also included in the index to make it a "covering index", so the query can be satisfied from the index, without a need to visit pages of the underlying table to retrieve values.
We want bookid as the leading column because of the reference in the ON clause. Again, having poster and bucketname available right in the index make it a "covering" index.
Use EXPLAIN to see the query execution plan.
Also, note that the inner join operation won't return a row from books if a matching row in images is not found. If we want to return a row from books even when no image is available, we could use an outer join.
I'd write the query like this:
SELECT b.id
, i.poster
, i.bucketname
, YEAR(b.releasedate)
FROM books b
LEFT
JOIN images i
ON i.bookid = b.id
WHERE b.name = ?

Database better inner join for performance in this case

I have 2 tables that manages the time spent on doing various things:
#times(id, time_in_minutes)
#times_intervals(id, times_id, time_in_minutes, start, end)
Then the #times might relate to different things:
#tasks(id, description)
#products(id, description, serial_number, year)
What is the best practice in order to reuse the same #times and #times_intervals for #task and #products?
I would think about:
#times(+task_id, +product_id)
// add task_id and product_id to the original #times table
But if I do so, when I'd join the #times table with #task and #products table would be slower as should choice between the 2 (task_id or product_id). When task_id is not null join on the #tasks and viceversa.
(I'm using MySQL6)
Thanks a lot
I would drop the time_in_minutes column from the times table. This information is redundant if it is just the sum of the detail and is a premature optimization.
I would add a product_time table containing product_id, times_id and a task_time table containing task_id, time_id
Then to get the total time with a product:
SELECT *
FROM product p
INNER JOIN product_time pt
ON pt.product_id = p.id
INNER JOIN (
SELECT times_id, SUM(time_in_minutes) as time_in_minutes
FROM times_intervals
GROUP BY times_id
) AS t
ON t.times_id = pt.times_id
Typically to make this perform, you would have a non-clustered covering index for times_intervals with columns times_id and time_in_minutes - note that the times table is simply a data-less header table at this point and the only purpose it to group the times_intervals and it's only necessary because you have this very similar arrangement for tasks.
If there were not two (or more) entities using the times_intervals, you might simply put product_id in the times_intervals and treat it as your header/master id.
I would suggest against adding an id column to times for every table you might join it to. It would break normalization and make joins much more complicated.
If you only have one time (or time interval) for a task or a product, make a column in that table that references the times table. Otherwise, you could make a separate table like
#multitimes(multi_id, time_id)
where the two columns together are a primary key, and then have products and tasks reference multi_id. Then each record in each of those tables can be related to any number of times without any conflicts.