SQL table performance - more or fewer tables? - mysql

While creating a notification system I ran across a question. The community the system is created for is rather big, and I have 2 ideas for my SQL tables:
Make one table which includes :
comments table:
id(AUTO_INCREMENT) | comment(text) | viewers_id(int) | date(datetime)
In this option, the comments are stored with a date and all users that viewed the comment divided with ",". For example:
1| Hi I'm a penguin|1,2,3,4|24.06.1879
The system should now use the column viewers_id to decide if it should show a notification or not.
make two tables like:
comments table:
id(AUTO_INCREMENT) | comment(text) | date(datetime)
viewer table:
id(AUTO_INCREMENT) | comment_id | viewers_id(int)
example:
5|I'm a rock|23.08.1778
1|5|1,2,3,4
In this example we check the viewers_id again.
Which of these is likely to have better performance?

In my opinion you shouldn't focus that much on optimizing your tables, since its far more rewarding to optimize your application first.
Now to your question:
Increasing the Performance of an SQL Table can be achieved in 2 ways:
1. Normalize as for every SQL Table i would recommend you to normalize it:
Wikipedia - normalization 2. you can reduce concurrency that means reducing the amount of times when data can't be accessed because it gets changed.
as for your example: if i had to pick one of those i would pick the second option.

Related

will normalization of school database make better perfomance

I have old server and huge MySQL database of students marks
There is only one common table for all students:
student_id | teacher_id | mark | comment
There are six schools in this project and ~800 of students, everyday we have ~5000 of marks
Students have problem with perfomance - every query of their marks takes about two minutes to get results
by the way I use table indexing
I have the question - if I use normalization and make separate table for every student like this:
STUDENTS_TABLE
student_id | table
ivanov | ivanov_table
IVANOV_TABLE
teacher_id | mark | comment
will it help me make better perfomance?
I have no oportunity to buy new server.
ADD:
when I use mysql> SELECT * FROM all_students_table where student_id=001 it takes to long. I think it is because of information of all students is in one huge table. And I suppose if individual table for every student will be created - it will take less time for query like this: mysql> SELECT * FROM student_001_table. Am I right?
ADD:
This table is three years old and
mysql> SELECT COUNT(*) FROM students_marks
give result 2 453 389 of rows and it grows every day
Since a simple query like SELECT * FROM all_students_table where student_id=001 takes too long, the only sensible conclusion is that the table does not have proper indexes. A query like this needs an index on student_id. When that index is present, the query should perform almost as good for 2.5 million rows as it does for 1,000 rows (assuming each student_id appears similarly frequently in the table)
First, WHAT #Arjan said about indexes.
Second, from experience, you will need at least 3 tables and probably 5 tables with very precise INDEXes and PRIMARY KEYS
Schools
Teachers
Students
Student_In_School
Student_Grade (Links Teacher/Student/Grade/Comment) <- Your current table
Third, and this is counter intuitive, performance will DECREASE, because you have to search multiple tables and link them. Normalization is NOT for performance, but rather for sanity checks and validation and elimination of repeated information. For example, you can NOW have the actual names of the teachers/students as opposed to an obscure IDs.
The good news is you have very little data (despite what you think) and an old machine can handle it with proper INDEXes
Hope this helps

Optimization for search

I'm working on a project and I have some problem with optimization in MySQL.
My main table looks like and have around 1M rows:
+----+------+---------+
| id | Name | city_id | City_id is between (0, 2000).
+----+------+---------+
I'll make many queries like:
SELECT * FROM table WHERE city_id=x
SELECT * FROM table WHERE city_id=x AND id=rand()
It is only to show you main operations on this database
If i'll make 2k small tables will it be good solution?
I think the solution you are looking for is an index. Try this:
create index idx_table_city_id on table(city_id, id);
SQL is designed to handle large tables. There are very few reasons why you would want to split up data from one table to multiple tables. The only good reason I can think of are when doing so is needed to meet security requirements.

many to many relation performance

I have three tables. My news have on or several categories.
News
-------------------------
| id | title | created
Category
-------------------------
| id | title
News_Category
-------------------------
| news_id | category_id
But i have many rows on News about 10,000,000 rows. Using joind for fetch data will be performance issue.
Select title from News_Category left join News on (News_Category.news_id = News.id)
group by News_Category.id order by News.created desc limit 10
I want to have best query for this issue. For many to many relation data in huge tables which query have better performance.
Please give me the best query for this use case.
The best performance for that query, is given by permanently store it. This is you need a materialized view.
On MySQL you can implement the materialized view by create a table.
this is
create table FooMaterializedView as
(select foo1.*, foo2.* from foo1 join foo2 on ( ... ) where ... order by ...);
and now depending on how often the source tables change (this is receive inserts, updates or deletes) and how much you need to use the latest version of the query you need to implement suitable view maintenance strategy.
This is, depending of your needs and the problem itself perform:
full computation (i.e. truncate the materialized view and generate it again from scratch) might be enough
incremental computation. If it is too costly to the system perform a full computation very often, you must capture only the changes on the source tables and update the materialized view according to the changes.
If you need to take the incremental approach, I can only wish you the best luck. I can point you that you can use triggers to capture the changes on the source tables, and you will need to either use an algorithmic or an equalization approach to compute the changes to make to the materialized view.

Performance Penalties for Unused Joins

I'm writing a script that generates a report based on a query that uses several tables joined together. One of the inputs to the script is going to be a list of the fields that are required on the report. Depending on the fields requested, some of the tables might not be needed. My question is: is there a [significant] performance penalty for including a join when if it is not referenced in a SELECT or WHERE clause?
Consider the following tables:
mysql> SELECT * FROM `Books`;
+----------------------+----------+
| title | authorId |
+----------------------+----------+
| Animal Farm | 3 |
| Brave New World | 2 |
| Fahrenheit 451 | 1 |
| Nineteen Eighty-Four | 3 |
+----------------------+----------+
mysql> SELECT * FROM `Authors`;
+----+----------+-----------+
| id | lastName | firstName |
+----+----------+-----------+
| 1 | Bradbury | Ray |
| 2 | Huxley | Aldous |
| 3 | Orwell | George |
+----+----------+-----------+
Does
SELECT
`Authors`.`lastName`
FROM
`Authors`
WHERE
`Authors`.`id` = 1
Outperform:
SELECT
`Authors`.`lastName`
FROM
`Authors`
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
WHERE
`Authors`.`id` = 1
?
It seems to me that MySQL should just know to ignore the JOIN completely, since the table is not referenced in the SELECT or WHERE clause. But somehow I doubt this is the case. Of course, this is a really basic example. The actual data involved will be much more complex.
And really, it's not a terribly huge deal... I just need to know if my script needs to be "smart" about the joins, and only include them if the fields requested will rely on them.
This isn't actually unused since it means that only Authors that exist in Books are included in the result set.
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
However if you "knew" that every Author existed in Book than there would be some performance benefit in removing the join but it would largely depend on idexes and the number of records in the table and the logic in the join (especially when doing data conversions)
This is the kind of question that is impossible to answer. Yes, adding the join will take additional time; it's impossible to tell whether you'd be able to measure that time without, well, uh....measuring the time.
Broadly speaking, if - like in your example - you're joining on primary keys, with unique indices, it's unlikely to make a measurable difference.
If you've got more complex joins (which you hint at), or are joining on fields without an index, or if your join involves a function, the performance penalty may be significant.
Of course, it may still be easier to do it this way that write multiple queries which are essentially the same, other than removing unneeded joins.
Final bit of advice - try abstracting the queries into views. That way, you can optimize performance once, and perhaps write your report queries in a more simple way...
Joins will always take time.
Side effects
On top of that inner join (which is the default join) influences the result by limiting the number of rows you get.
So depending on whether all authors are in books the two queries may or may not be identical.
Also if an author has written more than one book the resultset of the 'joined' query will show duplicate results.
Performance
In the WHERE clause you have stated authors.id to be a constant =1, therefore (provided you have indexes on author.id and books.author_id) it will be a very fast lookup for both tables. The query-time between the two tables will be very close.
In general joins can take quite a lot of time though and with all the added side effects should only be undertaken if you really want to use the extra info the join offers.
It seems that there are two things that you are trying to determine: If there are any optimizations that can be done between the two select statements, and which of the two would be the fastest to execute.
It seems that since the join really is limiting the returned results by authors who have books in the list, that there can not be that much optimization done.
It also seems that for the case that you were describing where the joined table really has no limiting effect on the returned results, that the query where there was no joining of the tables would perform faster.

Optimizing a simple query on two large tables

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?
As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.
You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.
Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.
If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).
You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.
You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.
You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person
Hope this helps,
Evert
Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.
However something's not right here, why are you doing a filesort for 271 rows?
Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.
sort_buffer should also have a sane value.