Proper database indexes with subquery - mysql

THE INFO
Currently I have two tables I am working with- a POST table that holds data for a individual posts, and a FAVORITES table that holds data for users that opt to save favorite posts in their profile.
The tables look like this:
On the POSTS table there is only a primary key on id, no indexes that I have set. On Favorites I have a combined index that I was testing of (postid, deviceid).
The POSTS table contains approx. 10,000 entries.
The FAVORITES table contains approx. 4,680,500 entries.
The query I use to grab the favorites from a particular deviceid is:
SELECT post FROM POSTS
WHERE id IN
(SELECT postid FROM favourites WHERE deviceid="12d4a4a4a4a4a4a");
THE PROBLEM:
With the amount of data being returned, and several devices having multiple favorites, the query can take upwards of 7-10 seconds to both COUNT favorites for a particular device and/or SELECT using the above query and subquery. When this happens during peak times, you can obviously imagine the issues that can cause.
Caching the query results is an option, but since the data is pretty specific in that the same user is not calling the query multiple times, but rather unique instances, I think there is a better solution. On another note, caching would need to be short lived, which would nullify its benefit.
I know the method of indexing, and I am familiar with foreign keys, but I'm not sure practically if and how they could be implemented between the query and the subquery to enhance performance.
Any advice/guidance is much appreciated.
Cheers,
Jared

SELECT post FROM POSTS
INNER JOIN favourites ON POSTS.id=favourites.postid
WHERE favourites.deviceid="12d4a4a4a4a4a4a");
split the index in favourites in 2 indices one on deviceid and one on postid

Why use a subquery? Have you tried a join?
SELECT post FROM posts INNER JOIN favourites ON posts.id=favourites.postid WHERE deviceid="12d4a4a4a4a4a4a"
You won't be using (only) your indices to retrieve the query results since the post field is not in any index. So you might actually end up saving time by making one query to get all the matching IDs from posts, then a second to get the post values.
Using EXPLAIN SELECT... will also help you optimize this query. Have you tried that?

On MySQL, composite indexes can only be used in the order the keys are defined. So for index (postid, deviceid), you can only use the index if you have a postid and need the deviceid. In your query here you're doing the opposite--you have a constant deviceid and want corresponding postid. So your query is not using any indexes.
More information on mysql composite indexes.
You should either add a deviceid index or reverse the index so that it's (deviceid, postid).
By the way, your favorites table looks a lot like a junction table. Consider whether you need the id column at all.

A couple of things you could do to improve performance:
Separate the device_id out to a device table with a surrogate primary key (an int) and a non-clustered index on the device_id varchar. The favorites table should only include the device table surrogate key. This should make the favorites table smaller and should make your favorites table index smaller. The smaller the index and smaller the table, the faster it will be to search.
Your favorites table index is wrong. It should not be (post_id,device_id). It should be (device_id,post_id) as your query needs to search by device_id first. As your favorites table row is so small, I question the value of including the post_id in the index. It just isn't worth the extra space for a possible marginal improvement in query speed.
EDIT: You need the post_id in the index to keep the entries unique (just make sure device_id is first).

Related

One composite index or many indexes for foreign keys?

Whats is the difference between creating a covering index for all the foreign keys of a relation table and creating one index for each column (foreign key) of the relation table ?
For instance, I have the table sales(p_id, e_id, c_id, ammount) where p_id is a foreign key (products table), e_id is a foreign key (employee table) and c_id a foreign key (customer_table). The primary key of the table is {p_id, e_id, c_id}.
Which on is better ?
CREATE INDEX cmpindex ON sales(p_id, e_id, c_id)
OR
CREATE INDEX pindex on sales(p_id)
CREATE INDEX eindex on sales(e_id)
CREATE INDEX cindex on sales(c_id)
I mostly run queries with joins on the relation table and the parent tables.
Which one is better depends on your actual queries.
One thing to understand is that when you join the table sales once in your query, it will only use one index for it (at the most). So you need to make sure an index is available that is most appropriate for the query.
If you join the sales table always to all three other tables (customer, product and employee) then a composite index is to be preferred, assuming that the engine will actually use it and not perform a table scan.
The order of the fields in the composite index is important when it comes to the order of the results. For instance, if your query is going to group the results by product (first), and then order the details per customer, you could benefit from an index that has the product id first, and the customer id as second.
But it may also be that the engine decides that it is better to start scanning the table sales first and then join in the other three tables using their respective primary key indexes. In that case no index is used that exists on the sales table.
The only way to find out is to get the execution plan of your query and see which indexes will be used when they are defined.
If you only have one query on the sales table, there is no need to have several indexes. But more likely you have several queries which output completely different results, with different field selections, filters, groupings, ...Etc.
In that case you may need several indexes, some of which will serve for one type of query, and others for others. Note that what you propose is not mutually exclusive. You could maybe benefit from several composite indexes, which just have a different order of fields.
Obviously, a multitude of indexes will slow down data changes in those tables, so you need to consider that trade-off as well.
Note that an index on a compound key will only be used if you query on the first portion, the first and second portion, the first, second and third portion, etc., so querying on p_id or p_id and e_id, etc. or even e_id and p_id will utilize the index. Indeed, any query containing p_id will use this index.
However, if you query your Sales table on e_id or c-id or any combination of these two, cmpindex will not be used and a full table scan will be performed.
One benefit of having an index on each foreign key (a non-unique index, as there could be multiple sales of the same product, or by the same employee, or to the same customer, leading to duplicate entries in the index) is that the query optimizer has the option of using the index to reduce the number of rows returned, and then doing a sequential search through the result set.
E.g. if the query is a search on sales of a particular product to a particular customer (regardless of employee) and you have a million sales, the foreign key index cindex could be used to return 20 sales items to that particular customer, and that result set could be very efficiently searched sequentially to find which of these sales were for a particular product.
If the search was performed on Product and pindex was used, the result set may be 10,000 rows (all sales of that product), which would have to be sequentially searched to find the sales of that product to a particular customer, leading to a very inefficient query.
I believe that the statistics kept for a table (used by the optimizer) keep track of the average number of rows that will be returned for a query using each index, so the optimizer will be able to work out that cindex should be used rather than pindex in the examples above. Alternatively, you can give hints on your queries to specify that a particular index be used.
It is, obviously, important to run UPDATE STATISTICS on a regular basis, as the execution plan would use pindex in the example above if there were, on average, only 10 sales of each product.
If your queries(search) propagates through sales for each of tables independently then you must create a separate index for each.
If that's not necessary then you can go for composite.
As HoneyBadger commented, you already have a composite index, since your primary key is itself an index.
In general, you should use a single index for each column whenever you think you will have queries involving each field by itself.
As stated here, when you have a composite index, it can work with queries involving all fields, or with queries involving the first field (in order), the first and the second, or the first,second and third together. It won't be used in queries involving only the second and third field.
The other answers are missing an important point. When you declare a foreign key in MySQL, it creates an index on the column. This is not (necessarily) true in other databases, but it is true in MySQL.
So, the declaration automatically creates these indexes:
CREATE INDEX pindex on sales(p_id);
CREATE INDEX eindex on sales(e_id);
CREATE INDEX cindex on sales(c_id);
(These indexes are very handy for dealing with cascading constraints and maintaining the data integrity based on the foreign key.)
If you happen to have also declared an index on sales(p_id, e_id, c_id, amount), then the first of the indexes is not needed -- it is a subset of this index. However, the other two are needed.
Is this index needed? As mentioned in other questions, that depends on the queries you want to use the index for. I recommend starting with the documentation on this subject to understand how the indexes get used.

Should I use multiple index method if indexed fields are also foreign keys?

After adding foreign keys mysql forced to index the keys which indexed before with multiple index method. I use InnoDB.
Here is a structure of my table:
id, company_id, student_id ...
company_id and student_id had been indexed using:
ALTER TABLE `table` ADD INDEX `multiple_index` (`company_id`,`student_id`)
Why I use multiple index column? Because, the most time my query is:
SELECT * FROM `table` WHERE company_id = 1 AND student_id = 3
Sometime i just fetch columns by student_id:
SELECT * FROM `table` WHERE student_id = 3
After adding foreign keys for company_id and student_id mysql indexed both of them separately. So, now I have multiple and separately indexed fields.
My question is should I drop the multiple indexed keys?
It depends. If the same student belongs to many companies, no, don't drop it. When querying for company_id = 1 AND student_id = 3, the optimizer has to pick one index, and after that, it will either have to check multiple students or multiple companies.
My gut tells me this won't be the case, though; students won't be associated with more than ~10 companies, so scanning over the index a bit won't be a big deal. That said, this is a lot more brittle than having the index on both columns. In that case, the optimizer knows what the right thing to do is, here. When it has two indices to pick from, it might not, and it might not in the future, so you should FORCE INDEX to make sure it uses the student_id index.
The other thing to consider is how this table is used. If it's rarely written to but read frequently, there's not much of a penalty beyond space for the extra index.
TL;DR: the index is not redundant. Whether or not you should keep it is complicated.

Question on how to improve my database structure and what fields should I index?

I am creating a simple comparison script and I have some questions for the database structure. Firstly the database will be huge, I am expecting more than 1 million entries in products.
Secondly, there will be a search form that the search term will look into (%$term%) the field name and display the product's related info and shop's info.
Below you can see my database structure named products.
id int(10) NOT NULL
name varchar(50) NOT NULL
link varchar(50) NOT NULL
description varchar(50) NOT NULL
image varchar(50) NOT NULL
price varchar(50) NOT NULL
My questions are:
Do you suggest me to index a field? Users will not be able to insert or update products, the only query will be SELECT to display the results and I will update the products from XML feeds often for possible products changes.
I have to store the shop info like name, shipping, link, image... This gives me two option. a) To create a new table named shops and join those two tables with a new field in products shopID that will look for the id in shops and display the info or b) Should I add these info (name, shipping, ...) in extra fields in products in every single product ? (I think the answer is obvious but I need your suggestion).
Are there any other things I should have in mind, or change?
I am not an advanced programmer and what I learn is through internet, so maybe the questions are too obvious for you, but for me is the ticket for learning.
Thank you for your answers.
Indexes are required to fetch records very fast. So yes, they're recommended. But what kind of an index would you like to use? MyISAM engine offers "regular" string index that you can use with a LIKE clause (e.g. LIKE 'hello%') but it restricts you from using a wildcard at the beginning of the search phrase. In addition, MyISAM has a FULLTEXT index that allows you to search words in the whole string, not just the beginning of the string. So you could create a FULLTEXT index on the columns description and name - but 2 FULLTEXT indexes seem redundant in this case. Maybe you could join those columns and separate the values with a token or a character? If so, you'll need to create only 1 FULLTEXT index on the joined column, which can save a lot fragmentation and disk space. One of the cons for using MyISAM engine is that when writing to it (UPDATE/DELETE queries) - it locks the entire table. So, if the table is written to many times a minute, it will probably make other queries hang. That's why you should see if InnoDB engine suits your needs - which enables concurrent read/write operations on the table.
That's probably a good idea, since having index on the column price seems essential, and FULLTEXT indexes doesn't work together with other indexes.
I'd say: Use InnoDB and Sphinx, and have a primary index on id & a regular index on price.
The most important thing for you to understand is that when writing a code for specific software, you must be well familiar with that software and it's caveats. You should read High performance MySQL - extremely recommended.
Edit:
If you want to add an indexes in the products table, you can do that with
ALTER TABLE /* etc */ when the table is empty or contains small amount of data. If the table has a lot of data, then it's recommended to create another table that's similar to products, altering that new table and populating it with data from the old products table, e.g.:
CREATE TABLE `products_new` LIKE `products`;
ALTER TABLE `products_new` ADD FULLTEXT (`name`);
LOCK TABLES `products` READ, `products_new` WRITE;
INSERT INTO `products_new` SELECT * FROM `products`;
LOCK TABLES `products` WRITE, `products_new` WRITE;
ALTER TABLE `products` RENAME TO `products_bad`;
ALTER TABLE `products_new` RENAME TO `products`;
/* The following doesn't work:
RENAME TABLE `products` TO `products_bad`, `products_new` TO `products`;
See: http://bugs.mysql.com/bug.php?id=22246
*/
DROP TABLE `products_bad`;
Nikolai,
The ID should be a primary key. That automatically puts an index on ID, and will speed up any queries that need to get specific products.
The shop table should be a second table, but you should have a 3rd table that joins product with shops. At it's most basic, it would have two fields, shop_id, product_id. This let's you have a single product in multiple shops. These two fields should be foreign keys to the product table and shop table.
If you are ever thinking about having a different price for a product per shop, then the product_store join table should also contain the price, although the base price could be stored in the products table.
Price should be a decimal, so that you can do calculations on the price field.
1) You should generally index fields that are commonly used. However since your search on name uses a wildcard at the start an index will have no effect on this query.
2) Creating a shops table and linking to this would be better.
Price for sure because something tells me you will search over this field and do orderings.
"Premature optimization is a root of all evil" (c) Donald Knuth. So, I suggest to normalize your tables, so YES - create table for shops. Once your applicated grown big, and you faced to highloads, you will be able to denormalize your database to avoid JOINS (one way to optimize your voracious application)
Get back to stackoverflow with your problem ;-)
Generally you should index fields that will be intensively used. But using wildcard for your search won't help much.
Better use another table with foreign key.
Also shouldn't your "id" field in your products table be define as PRIMARY KEY ?
Here are my suggestions:
To be able to search for %term% you need full-text search, an index will not do you any good when the search-term starts with a wildcard.
Yes you should put an index on the id-column (and probably make it auto increment) since that seems to be the unique column in the table. Other than that there's no point in us suggesting any other indexes since we don't which queries you are going to run.
Yes, create another table for shops, otherwise you will have data that is not normalized, for shop-name and so on (there might be rare cases that "require" de-normalization, such as optimization, but you have not reached there yet). Not normalized data will cause problems, in your specific case, such as what will you do when a shop needs to change it name? Well, you will have to update all matching rows in the product table.
There are many things you should keep in mind, but it's out of scope for this answer. I suggest that you get to work and learn as you go, because learning by doing is a great way become a better developer. Then when you hit a specific problem, search for/post it here on stackoverflow.

How to setup indexes on a two-column table for quick querying on both columns?

If I have a group members table with two columns in it: group_id and user_id, where users can be part of multiple groups and groups can contain many users, what would be the best way to setup the indexes? I want to be able to quickly determine which users are in a single group, so I think I would need to index on group_id, but I also want to quickly determine all the groups any single user is in, which makes me think I would also need to index on user_id.
Is it good to put a separate index on both group_id and user_id if those are the only columns in the table? Is there a better way to setup the indexes?
This sounds like something that should be a composite primary key, which would automatically index both columns and get you the best performance.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.
SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.
Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.
I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.