many to many relation performance - mysql

I have three tables. My news have on or several categories.
News
-------------------------
| id | title | created
Category
-------------------------
| id | title
News_Category
-------------------------
| news_id | category_id
But i have many rows on News about 10,000,000 rows. Using joind for fetch data will be performance issue.
Select title from News_Category left join News on (News_Category.news_id = News.id)
group by News_Category.id order by News.created desc limit 10
I want to have best query for this issue. For many to many relation data in huge tables which query have better performance.
Please give me the best query for this use case.

The best performance for that query, is given by permanently store it. This is you need a materialized view.
On MySQL you can implement the materialized view by create a table.
this is
create table FooMaterializedView as
(select foo1.*, foo2.* from foo1 join foo2 on ( ... ) where ... order by ...);
and now depending on how often the source tables change (this is receive inserts, updates or deletes) and how much you need to use the latest version of the query you need to implement suitable view maintenance strategy.
This is, depending of your needs and the problem itself perform:
full computation (i.e. truncate the materialized view and generate it again from scratch) might be enough
incremental computation. If it is too costly to the system perform a full computation very often, you must capture only the changes on the source tables and update the materialized view according to the changes.
If you need to take the incremental approach, I can only wish you the best luck. I can point you that you can use triggers to capture the changes on the source tables, and you will need to either use an algorithmic or an equalization approach to compute the changes to make to the materialized view.

Related

SQL table performance - more or fewer tables?

While creating a notification system I ran across a question. The community the system is created for is rather big, and I have 2 ideas for my SQL tables:
Make one table which includes :
comments table:
id(AUTO_INCREMENT) | comment(text) | viewers_id(int) | date(datetime)
In this option, the comments are stored with a date and all users that viewed the comment divided with ",". For example:
1| Hi I'm a penguin|1,2,3,4|24.06.1879
The system should now use the column viewers_id to decide if it should show a notification or not.
make two tables like:
comments table:
id(AUTO_INCREMENT) | comment(text) | date(datetime)
viewer table:
id(AUTO_INCREMENT) | comment_id | viewers_id(int)
example:
5|I'm a rock|23.08.1778
1|5|1,2,3,4
In this example we check the viewers_id again.
Which of these is likely to have better performance?
In my opinion you shouldn't focus that much on optimizing your tables, since its far more rewarding to optimize your application first.
Now to your question:
Increasing the Performance of an SQL Table can be achieved in 2 ways:
1. Normalize as for every SQL Table i would recommend you to normalize it:
Wikipedia - normalization 2. you can reduce concurrency that means reducing the amount of times when data can't be accessed because it gets changed.
as for your example: if i had to pick one of those i would pick the second option.

Optimization for search

I'm working on a project and I have some problem with optimization in MySQL.
My main table looks like and have around 1M rows:
+----+------+---------+
| id | Name | city_id | City_id is between (0, 2000).
+----+------+---------+
I'll make many queries like:
SELECT * FROM table WHERE city_id=x
SELECT * FROM table WHERE city_id=x AND id=rand()
It is only to show you main operations on this database
If i'll make 2k small tables will it be good solution?
I think the solution you are looking for is an index. Try this:
create index idx_table_city_id on table(city_id, id);
SQL is designed to handle large tables. There are very few reasons why you would want to split up data from one table to multiple tables. The only good reason I can think of are when doing so is needed to meet security requirements.

MySQL query painfully slow on large data

I'm no MySQL whiz but I get it, I have just inherited a pretty large table (600,000 rows and around 90 columns (Please kill me...)) and I have a smaller table that I've created to link it with a categories table.
I'm trying to query said table with a left join so I have both sets of data in one object but it runs terribly slow and I'm not hot enough to sort it out; I'd really appreciate a little guidance and explanation as to why it's so slow.
SELECT
`products`.`Product_number`,
`products`.`Price`,
`products`.`Previous_Price_1`,
`products`.`Previous_Price_2`,
`products`.`Product_number`,
`products`.`AverageOverallRating`,
`products`.`Name`,
`products`.`Brand_description`
FROM `product_categories`
LEFT OUTER JOIN `products`
ON `products`.`product_id`= `product_categories`.`product_id`
WHERE COALESCE(product_categories.cat4, product_categories.cat3,
product_categories.cat2, product_categories.cat1) = '123456'
AND `product_categories`.`product_id` != 0
The two tables are MyISAM, the products table has indexing on Product_number and Brand_Description and the product_categories table has a unique index on all columns combined; if this info is of any help at all.
Having inherited this system I need to get this working asap before I nuke it and do it properly so any help right now will earn you my utmost respect!
[Edit]
Here is the output of the explain extended:
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| 1 | SIMPLE | product_categories | index | NULL | cat1 | 23 | NULL | 1224419 | 100.00 | Using where; Using index |
| 1 | SIMPLE | products | ALL | Product_id | NULL | NULL | NULL | 512376 | 100.00 | |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
Optimize Table
To establish a baseline, I would first recommend running an OPTIMIZE TABLE command on both tables. Please note that this might take some time. From the docs:
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
[...]
For MyISAM tables, OPTIMIZE TABLE works as follows:
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
Indexing
If space and index management isn't a concern, you can try adding a composite index on
product_categories.cat4, product_categories.cat3, product_categories.cat2, product_categories.cat1
This would be advised if you use a leftmost subset of these columns often in your queries. The query plan indicates that it can use the cat1 index of product_categories. This most likely only includes the cat1 column. By adding all four category columns to an index, it can more efficiently seek to the desired row. From the docs:
MySQL can use multiple-column indexes for queries that test all the
columns in the index, or queries that test just the first column, the
first two columns, the first three columns, and so on. If you specify
the columns in the right order in the index definition, a single
composite index can speed up several kinds of queries on the same
table.
Structure
Furthermore, given that your table has 90 columns you should also be aware that a wider table can lead to slower query performance. You may want to consider Vertically Partitioning your table into multiple tables:
Having too many columns can bloat your record size, which in turn
results in more memory blocks being read in and out of memory causing
higher I/O. This can hurt performance. One way to combat this is to
split your tables into smaller more independent tables with smaller
cardinalities than the original. This should now allow for a better
Blocking Factor (as defined above) which means less I/O and faster
performance. This process of breaking apart the table like this is a
called a Vertical Partition.
The meaning of your query seems to be "find all products that have the category '123456'." Is that correct?
COALESCE is an extraordinarily expensive function to use in a WHERE statement, because it operates on index-hostile NULL values. Your explain result shows that your query is not being very selective on your product_categories table. In MySQL you need to avoid functions in WHERE statements altogether if you want to exploit indexes to make your queries fast.
The thing someone else said about 90-column tables being harmful is also true. But you're stuck with it, so let's just deal with it.
Can we rework your query to get rid of the function-based WHERE? Let's try this.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT DISTINCT product_id
FROM product_categories
WHERE product_id <> 0
AND ( cat1='123456'
OR cat2='123456'
OR cat3='123456'
OR cat4='123456')
)
For this to work fast you're going to need to create separate indexes on your four cat columns. The composite unique index ("on all columns combined") is not going to help you. It still may not be so good.
A better solution might be FULLTEXT searching IN BOOLEAN MODE. You're working with the MyISAM access method so this is possible. It's definitely worth a try. It could be very fast indeed.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT product_id
FROM product_categories
WHERE MATCH(cat1,cat2,cat3,cat4)
AGAINST('123456' IN BOOLEAN MODE)
AND product_id <> 0
)
For this to work fast you're going to need to create a FULLTEXT index like so.
CREATE FULLTEXT INDEX cat_lookup
ON product_categories (cat1, cat2, cat3, cat4)
Note that neither of these suggested queries produce precisely the same results as your COALESCE query. The way your COALESCE query is set up, some combinations won't match it that will match these queries. For example.
cat1 cat2 cat3 cat4
123451 123453 123455 123456 matches your and my queries
123456 123455 123454 123452 matches my queries but not yours
But it's likely that my queries will produce a useful list of products, even if it has a few more items in yours.
You can debug this stuff by just working with the inner queries on product_categories.
There is something strange. Does the table product_categories indeed have a product_id column? Shouldn't the from and where clauses be like this:
FROM `product_categories` pc
LEFT OUTER JOIN `products` p ON p.category_id = pc.id
WHERE
COALESCE(product_categories.cat4, product_categories.cat3,product_categories.cat2, product_categories.cat1) = '123456'
AND pc.id != 0

Performance Penalties for Unused Joins

I'm writing a script that generates a report based on a query that uses several tables joined together. One of the inputs to the script is going to be a list of the fields that are required on the report. Depending on the fields requested, some of the tables might not be needed. My question is: is there a [significant] performance penalty for including a join when if it is not referenced in a SELECT or WHERE clause?
Consider the following tables:
mysql> SELECT * FROM `Books`;
+----------------------+----------+
| title | authorId |
+----------------------+----------+
| Animal Farm | 3 |
| Brave New World | 2 |
| Fahrenheit 451 | 1 |
| Nineteen Eighty-Four | 3 |
+----------------------+----------+
mysql> SELECT * FROM `Authors`;
+----+----------+-----------+
| id | lastName | firstName |
+----+----------+-----------+
| 1 | Bradbury | Ray |
| 2 | Huxley | Aldous |
| 3 | Orwell | George |
+----+----------+-----------+
Does
SELECT
`Authors`.`lastName`
FROM
`Authors`
WHERE
`Authors`.`id` = 1
Outperform:
SELECT
`Authors`.`lastName`
FROM
`Authors`
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
WHERE
`Authors`.`id` = 1
?
It seems to me that MySQL should just know to ignore the JOIN completely, since the table is not referenced in the SELECT or WHERE clause. But somehow I doubt this is the case. Of course, this is a really basic example. The actual data involved will be much more complex.
And really, it's not a terribly huge deal... I just need to know if my script needs to be "smart" about the joins, and only include them if the fields requested will rely on them.
This isn't actually unused since it means that only Authors that exist in Books are included in the result set.
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
However if you "knew" that every Author existed in Book than there would be some performance benefit in removing the join but it would largely depend on idexes and the number of records in the table and the logic in the join (especially when doing data conversions)
This is the kind of question that is impossible to answer. Yes, adding the join will take additional time; it's impossible to tell whether you'd be able to measure that time without, well, uh....measuring the time.
Broadly speaking, if - like in your example - you're joining on primary keys, with unique indices, it's unlikely to make a measurable difference.
If you've got more complex joins (which you hint at), or are joining on fields without an index, or if your join involves a function, the performance penalty may be significant.
Of course, it may still be easier to do it this way that write multiple queries which are essentially the same, other than removing unneeded joins.
Final bit of advice - try abstracting the queries into views. That way, you can optimize performance once, and perhaps write your report queries in a more simple way...
Joins will always take time.
Side effects
On top of that inner join (which is the default join) influences the result by limiting the number of rows you get.
So depending on whether all authors are in books the two queries may or may not be identical.
Also if an author has written more than one book the resultset of the 'joined' query will show duplicate results.
Performance
In the WHERE clause you have stated authors.id to be a constant =1, therefore (provided you have indexes on author.id and books.author_id) it will be a very fast lookup for both tables. The query-time between the two tables will be very close.
In general joins can take quite a lot of time though and with all the added side effects should only be undertaken if you really want to use the extra info the join offers.
It seems that there are two things that you are trying to determine: If there are any optimizations that can be done between the two select statements, and which of the two would be the fastest to execute.
It seems that since the join really is limiting the returned results by authors who have books in the list, that there can not be that much optimization done.
It also seems that for the case that you were describing where the joined table really has no limiting effect on the returned results, that the query where there was no joining of the tables would perform faster.

Optimizing a simple query on two large tables

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?
As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.
You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.
Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.
If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).
You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.
You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.
You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person
Hope this helps,
Evert
Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.
However something's not right here, why are you doing a filesort for 271 rows?
Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.
sort_buffer should also have a sane value.