Optimizing a simple query on two large tables - mysql

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?

As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.
You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.
Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.
If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).

You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.
You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.
You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person
Hope this helps,
Evert

Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.
However something's not right here, why are you doing a filesort for 271 rows?
Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.
sort_buffer should also have a sane value.

Related

SQL table performance - more or fewer tables?

While creating a notification system I ran across a question. The community the system is created for is rather big, and I have 2 ideas for my SQL tables:
Make one table which includes :
comments table:
id(AUTO_INCREMENT) | comment(text) | viewers_id(int) | date(datetime)
In this option, the comments are stored with a date and all users that viewed the comment divided with ",". For example:
1| Hi I'm a penguin|1,2,3,4|24.06.1879
The system should now use the column viewers_id to decide if it should show a notification or not.
make two tables like:
comments table:
id(AUTO_INCREMENT) | comment(text) | date(datetime)
viewer table:
id(AUTO_INCREMENT) | comment_id | viewers_id(int)
example:
5|I'm a rock|23.08.1778
1|5|1,2,3,4
In this example we check the viewers_id again.
Which of these is likely to have better performance?
In my opinion you shouldn't focus that much on optimizing your tables, since its far more rewarding to optimize your application first.
Now to your question:
Increasing the Performance of an SQL Table can be achieved in 2 ways:
1. Normalize as for every SQL Table i would recommend you to normalize it:
Wikipedia - normalization 2. you can reduce concurrency that means reducing the amount of times when data can't be accessed because it gets changed.
as for your example: if i had to pick one of those i would pick the second option.

what is the best way to add multiple indexes to polymorphic table

let say I have polymorphic similar to this
| document_id | owner_type | owner_id |
| 1 | Client | 1 |
| 1 | Client | 2 |
| 2 | User | 1 |
I know I'll be calling queries looking for owner_type and owner_type + owner_id
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client`
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client` and owner_id = 1
Lets ignore how to index document_id I would like to know what is the best way(performance) to index owner columns for this SQL scenarios
Solution 1:
CREATE INDEX do_type_id_ix ON document_ownerships (owner_type, owner_id)
this way I would have just one index that works for both scenarios
Solution 2:
CREATE INDEX do_id_type_ix ON document_ownerships (owner_id, owner_type)
CREATE INDEX do_type_ix ON document_ownerships (owner_type)
this way I would have indexes that totally match the way how I will use database. The only thing is that I have 2 indexes when I can have just one
Solution 3:
CREATE INDEX do_id_ix ON document_ownerships (owner_id)
CREATE INDEX do_type_ix ON document_ownerships (owner_type)
individual column indexes
From what I was exploring in MySQL console with explain I get really similar results and because Its a new project I don't have enought data to properly explore this so that I'll be 100% sure (even when I populated databese with several hundred records). So can anyone give me piece of advise from their experience ?
This is going to depend a lot on the distribution of your data - indexes only make sense if there is good selectivity in the indexed columns.
e.g. if there are only 2 possible values for owner_type, viz Client and User, and assuming they are distributed evenly, then any index only on owner_type will be pointless. In this case, a query like
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client`;
would likely return a large percentage of the records in the table, and a scan is the best that is possible (Although I'm assuming your real queries will join to the derived tables and filter on derived table-specific columns, which would be a very different query plan to this one.)
Thus I would consider indexing
Only on owner_id, assuming this gives a good degree of selectivity by itself,
Or, on the combination (owner_id, owner_type) only if there is evidence that index #1 isn't selective, AND if the the combination of the 2 fields gives sufficient selectivity to warrant this the index.

MySQL query painfully slow on large data

I'm no MySQL whiz but I get it, I have just inherited a pretty large table (600,000 rows and around 90 columns (Please kill me...)) and I have a smaller table that I've created to link it with a categories table.
I'm trying to query said table with a left join so I have both sets of data in one object but it runs terribly slow and I'm not hot enough to sort it out; I'd really appreciate a little guidance and explanation as to why it's so slow.
SELECT
`products`.`Product_number`,
`products`.`Price`,
`products`.`Previous_Price_1`,
`products`.`Previous_Price_2`,
`products`.`Product_number`,
`products`.`AverageOverallRating`,
`products`.`Name`,
`products`.`Brand_description`
FROM `product_categories`
LEFT OUTER JOIN `products`
ON `products`.`product_id`= `product_categories`.`product_id`
WHERE COALESCE(product_categories.cat4, product_categories.cat3,
product_categories.cat2, product_categories.cat1) = '123456'
AND `product_categories`.`product_id` != 0
The two tables are MyISAM, the products table has indexing on Product_number and Brand_Description and the product_categories table has a unique index on all columns combined; if this info is of any help at all.
Having inherited this system I need to get this working asap before I nuke it and do it properly so any help right now will earn you my utmost respect!
[Edit]
Here is the output of the explain extended:
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| 1 | SIMPLE | product_categories | index | NULL | cat1 | 23 | NULL | 1224419 | 100.00 | Using where; Using index |
| 1 | SIMPLE | products | ALL | Product_id | NULL | NULL | NULL | 512376 | 100.00 | |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
Optimize Table
To establish a baseline, I would first recommend running an OPTIMIZE TABLE command on both tables. Please note that this might take some time. From the docs:
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
[...]
For MyISAM tables, OPTIMIZE TABLE works as follows:
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
Indexing
If space and index management isn't a concern, you can try adding a composite index on
product_categories.cat4, product_categories.cat3, product_categories.cat2, product_categories.cat1
This would be advised if you use a leftmost subset of these columns often in your queries. The query plan indicates that it can use the cat1 index of product_categories. This most likely only includes the cat1 column. By adding all four category columns to an index, it can more efficiently seek to the desired row. From the docs:
MySQL can use multiple-column indexes for queries that test all the
columns in the index, or queries that test just the first column, the
first two columns, the first three columns, and so on. If you specify
the columns in the right order in the index definition, a single
composite index can speed up several kinds of queries on the same
table.
Structure
Furthermore, given that your table has 90 columns you should also be aware that a wider table can lead to slower query performance. You may want to consider Vertically Partitioning your table into multiple tables:
Having too many columns can bloat your record size, which in turn
results in more memory blocks being read in and out of memory causing
higher I/O. This can hurt performance. One way to combat this is to
split your tables into smaller more independent tables with smaller
cardinalities than the original. This should now allow for a better
Blocking Factor (as defined above) which means less I/O and faster
performance. This process of breaking apart the table like this is a
called a Vertical Partition.
The meaning of your query seems to be "find all products that have the category '123456'." Is that correct?
COALESCE is an extraordinarily expensive function to use in a WHERE statement, because it operates on index-hostile NULL values. Your explain result shows that your query is not being very selective on your product_categories table. In MySQL you need to avoid functions in WHERE statements altogether if you want to exploit indexes to make your queries fast.
The thing someone else said about 90-column tables being harmful is also true. But you're stuck with it, so let's just deal with it.
Can we rework your query to get rid of the function-based WHERE? Let's try this.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT DISTINCT product_id
FROM product_categories
WHERE product_id <> 0
AND ( cat1='123456'
OR cat2='123456'
OR cat3='123456'
OR cat4='123456')
)
For this to work fast you're going to need to create separate indexes on your four cat columns. The composite unique index ("on all columns combined") is not going to help you. It still may not be so good.
A better solution might be FULLTEXT searching IN BOOLEAN MODE. You're working with the MyISAM access method so this is possible. It's definitely worth a try. It could be very fast indeed.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT product_id
FROM product_categories
WHERE MATCH(cat1,cat2,cat3,cat4)
AGAINST('123456' IN BOOLEAN MODE)
AND product_id <> 0
)
For this to work fast you're going to need to create a FULLTEXT index like so.
CREATE FULLTEXT INDEX cat_lookup
ON product_categories (cat1, cat2, cat3, cat4)
Note that neither of these suggested queries produce precisely the same results as your COALESCE query. The way your COALESCE query is set up, some combinations won't match it that will match these queries. For example.
cat1 cat2 cat3 cat4
123451 123453 123455 123456 matches your and my queries
123456 123455 123454 123452 matches my queries but not yours
But it's likely that my queries will produce a useful list of products, even if it has a few more items in yours.
You can debug this stuff by just working with the inner queries on product_categories.
There is something strange. Does the table product_categories indeed have a product_id column? Shouldn't the from and where clauses be like this:
FROM `product_categories` pc
LEFT OUTER JOIN `products` p ON p.category_id = pc.id
WHERE
COALESCE(product_categories.cat4, product_categories.cat3,product_categories.cat2, product_categories.cat1) = '123456'
AND pc.id != 0

Performance Penalties for Unused Joins

I'm writing a script that generates a report based on a query that uses several tables joined together. One of the inputs to the script is going to be a list of the fields that are required on the report. Depending on the fields requested, some of the tables might not be needed. My question is: is there a [significant] performance penalty for including a join when if it is not referenced in a SELECT or WHERE clause?
Consider the following tables:
mysql> SELECT * FROM `Books`;
+----------------------+----------+
| title | authorId |
+----------------------+----------+
| Animal Farm | 3 |
| Brave New World | 2 |
| Fahrenheit 451 | 1 |
| Nineteen Eighty-Four | 3 |
+----------------------+----------+
mysql> SELECT * FROM `Authors`;
+----+----------+-----------+
| id | lastName | firstName |
+----+----------+-----------+
| 1 | Bradbury | Ray |
| 2 | Huxley | Aldous |
| 3 | Orwell | George |
+----+----------+-----------+
Does
SELECT
`Authors`.`lastName`
FROM
`Authors`
WHERE
`Authors`.`id` = 1
Outperform:
SELECT
`Authors`.`lastName`
FROM
`Authors`
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
WHERE
`Authors`.`id` = 1
?
It seems to me that MySQL should just know to ignore the JOIN completely, since the table is not referenced in the SELECT or WHERE clause. But somehow I doubt this is the case. Of course, this is a really basic example. The actual data involved will be much more complex.
And really, it's not a terribly huge deal... I just need to know if my script needs to be "smart" about the joins, and only include them if the fields requested will rely on them.
This isn't actually unused since it means that only Authors that exist in Books are included in the result set.
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
However if you "knew" that every Author existed in Book than there would be some performance benefit in removing the join but it would largely depend on idexes and the number of records in the table and the logic in the join (especially when doing data conversions)
This is the kind of question that is impossible to answer. Yes, adding the join will take additional time; it's impossible to tell whether you'd be able to measure that time without, well, uh....measuring the time.
Broadly speaking, if - like in your example - you're joining on primary keys, with unique indices, it's unlikely to make a measurable difference.
If you've got more complex joins (which you hint at), or are joining on fields without an index, or if your join involves a function, the performance penalty may be significant.
Of course, it may still be easier to do it this way that write multiple queries which are essentially the same, other than removing unneeded joins.
Final bit of advice - try abstracting the queries into views. That way, you can optimize performance once, and perhaps write your report queries in a more simple way...
Joins will always take time.
Side effects
On top of that inner join (which is the default join) influences the result by limiting the number of rows you get.
So depending on whether all authors are in books the two queries may or may not be identical.
Also if an author has written more than one book the resultset of the 'joined' query will show duplicate results.
Performance
In the WHERE clause you have stated authors.id to be a constant =1, therefore (provided you have indexes on author.id and books.author_id) it will be a very fast lookup for both tables. The query-time between the two tables will be very close.
In general joins can take quite a lot of time though and with all the added side effects should only be undertaken if you really want to use the extra info the join offers.
It seems that there are two things that you are trying to determine: If there are any optimizations that can be done between the two select statements, and which of the two would be the fastest to execute.
It seems that since the join really is limiting the returned results by authors who have books in the list, that there can not be that much optimization done.
It also seems that for the case that you were describing where the joined table really has no limiting effect on the returned results, that the query where there was no joining of the tables would perform faster.

Really slow query used to be really fast. Explain shows rows=1 on local backup but rows=2287359 on server

I check for spam every now and then using "select * from posts where post like '%http://%' order by id desc limit 10" and searching a few other keywords. Lately the select is impossibly slow.
mysql> explain select * from posts where reply like "%http://%" order by id desc limit 1;
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | posts | index | NULL | PRIMARY | 4 | NULL | 2287347 | Using where |
+----+-------------+-----------+-------+---------------+---------+---------+------+---------+-------------+
1 row in set (0.00 sec)
On my netbook with 1 gig ram the only difference is it shows the "ROWS" column as being 1. There is only 1.3 mil posts in my netbook. The server has like 6 gigs ram and a fast processor. What should I optimize so it's not horribly slow. Recently I added an index to search by userId, which I'm not sure was a smart choice, but I added it to the backup and production server both a little before this issue started happening. I'd imagine it's related to it not being able to sort in ram due to a missed tweak?
It also seems to be slow when I do stuff like "delete from posts where threadId=X", dunno if related.
With respect to
SELECT * FROM posts WHERE reply LIKE "%http://%" ORDER BY id DESC LIMIT 1
Due to the wild cards on both sides of the http://, MySQL will can not use an index on reply to quickly find what you're looking for. Moreover, since you're asking for the one with the largest id, MySQL will have to pull all results to make sure that you have the one with the largest `id'.
Depending how much of the data of the posts table is made up of the reply, it might be worthwhile to add a compound index on (id, reply), and change the query to something like
SELECT id FROM posts WHERE reply LIKE "%http://%" ORDER BY id DESC LIMIT 1
(which will have an index only execution), then join to the posts table or retrive the posts using the retrived ids. If the query has index only execution, and the index fits in memory and is already in memory (due to normal use or by intentionality warming it up), you could potentially speed up the query execution.
Having said all that, if identical queries on two identical servers with identical data are giving different execution plans and execution times, it might be time to OPTIMIZE TABLE posts to refresh the index statistics and/or defragment the table. If you have recently been adding/removing indexes, things might have gotten astray. Moreover, if the the data is fragmented, when it's pulling rows in PRIMARY KEY order, it could be jumping all over the disk to retrieve the data.
With respect to DELETE FROM posts WHERE threadId=X, it should be fine as long as there is an index on threadId.
Indexes won't be used if you start your search comparison with a "%". You problem is with
where reply like "%http://%"
As confirmed by your explain, no indexes are used. The speed difference may be due to caching.
What kind of indexes do you have on your table(s)? A good rule of thumb is to have an index on the columns that appear most often in your WHERE clause. If you do not have an index on your threadId column, your last query will be a lot slower than if you did.
Your first query (select * from posts where post like '%http://%' will be slow simply due to the "like" in the query. I would suggest filtering your query with another WHERE clause - perhaps by date (which is hopefully indexed):
select * from posts where postdate > 'SOMEDATE' and post like '%http://%'
Can you write an after-insert trigger that examines the text looking for substring 'http://' and either flags the current record or writes out its id to a SPAM table? As #brent said, indexes are not used for "contains substring" searches.