I'm working on a project and I have some problem with optimization in MySQL.
My main table looks like and have around 1M rows:
+----+------+---------+
| id | Name | city_id | City_id is between (0, 2000).
+----+------+---------+
I'll make many queries like:
SELECT * FROM table WHERE city_id=x
SELECT * FROM table WHERE city_id=x AND id=rand()
It is only to show you main operations on this database
If i'll make 2k small tables will it be good solution?
I think the solution you are looking for is an index. Try this:
create index idx_table_city_id on table(city_id, id);
SQL is designed to handle large tables. There are very few reasons why you would want to split up data from one table to multiple tables. The only good reason I can think of are when doing so is needed to meet security requirements.
Related
I have old server and huge MySQL database of students marks
There is only one common table for all students:
student_id | teacher_id | mark | comment
There are six schools in this project and ~800 of students, everyday we have ~5000 of marks
Students have problem with perfomance - every query of their marks takes about two minutes to get results
by the way I use table indexing
I have the question - if I use normalization and make separate table for every student like this:
STUDENTS_TABLE
student_id | table
ivanov | ivanov_table
IVANOV_TABLE
teacher_id | mark | comment
will it help me make better perfomance?
I have no oportunity to buy new server.
ADD:
when I use mysql> SELECT * FROM all_students_table where student_id=001 it takes to long. I think it is because of information of all students is in one huge table. And I suppose if individual table for every student will be created - it will take less time for query like this: mysql> SELECT * FROM student_001_table. Am I right?
ADD:
This table is three years old and
mysql> SELECT COUNT(*) FROM students_marks
give result 2 453 389 of rows and it grows every day
Since a simple query like SELECT * FROM all_students_table where student_id=001 takes too long, the only sensible conclusion is that the table does not have proper indexes. A query like this needs an index on student_id. When that index is present, the query should perform almost as good for 2.5 million rows as it does for 1,000 rows (assuming each student_id appears similarly frequently in the table)
First, WHAT #Arjan said about indexes.
Second, from experience, you will need at least 3 tables and probably 5 tables with very precise INDEXes and PRIMARY KEYS
Schools
Teachers
Students
Student_In_School
Student_Grade (Links Teacher/Student/Grade/Comment) <- Your current table
Third, and this is counter intuitive, performance will DECREASE, because you have to search multiple tables and link them. Normalization is NOT for performance, but rather for sanity checks and validation and elimination of repeated information. For example, you can NOW have the actual names of the teachers/students as opposed to an obscure IDs.
The good news is you have very little data (despite what you think) and an old machine can handle it with proper INDEXes
Hope this helps
While creating a notification system I ran across a question. The community the system is created for is rather big, and I have 2 ideas for my SQL tables:
Make one table which includes :
comments table:
id(AUTO_INCREMENT) | comment(text) | viewers_id(int) | date(datetime)
In this option, the comments are stored with a date and all users that viewed the comment divided with ",". For example:
1| Hi I'm a penguin|1,2,3,4|24.06.1879
The system should now use the column viewers_id to decide if it should show a notification or not.
make two tables like:
comments table:
id(AUTO_INCREMENT) | comment(text) | date(datetime)
viewer table:
id(AUTO_INCREMENT) | comment_id | viewers_id(int)
example:
5|I'm a rock|23.08.1778
1|5|1,2,3,4
In this example we check the viewers_id again.
Which of these is likely to have better performance?
In my opinion you shouldn't focus that much on optimizing your tables, since its far more rewarding to optimize your application first.
Now to your question:
Increasing the Performance of an SQL Table can be achieved in 2 ways:
1. Normalize as for every SQL Table i would recommend you to normalize it:
Wikipedia - normalization 2. you can reduce concurrency that means reducing the amount of times when data can't be accessed because it gets changed.
as for your example: if i had to pick one of those i would pick the second option.
let say I have polymorphic similar to this
| document_id | owner_type | owner_id |
| 1 | Client | 1 |
| 1 | Client | 2 |
| 2 | User | 1 |
I know I'll be calling queries looking for owner_type and owner_type + owner_id
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client`
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client` and owner_id = 1
Lets ignore how to index document_id I would like to know what is the best way(performance) to index owner columns for this SQL scenarios
Solution 1:
CREATE INDEX do_type_id_ix ON document_ownerships (owner_type, owner_id)
this way I would have just one index that works for both scenarios
Solution 2:
CREATE INDEX do_id_type_ix ON document_ownerships (owner_id, owner_type)
CREATE INDEX do_type_ix ON document_ownerships (owner_type)
this way I would have indexes that totally match the way how I will use database. The only thing is that I have 2 indexes when I can have just one
Solution 3:
CREATE INDEX do_id_ix ON document_ownerships (owner_id)
CREATE INDEX do_type_ix ON document_ownerships (owner_type)
individual column indexes
From what I was exploring in MySQL console with explain I get really similar results and because Its a new project I don't have enought data to properly explore this so that I'll be 100% sure (even when I populated databese with several hundred records). So can anyone give me piece of advise from their experience ?
This is going to depend a lot on the distribution of your data - indexes only make sense if there is good selectivity in the indexed columns.
e.g. if there are only 2 possible values for owner_type, viz Client and User, and assuming they are distributed evenly, then any index only on owner_type will be pointless. In this case, a query like
SELECT * FROM document_name_ownerships WHERE owner_type = 'Client`;
would likely return a large percentage of the records in the table, and a scan is the best that is possible (Although I'm assuming your real queries will join to the derived tables and filter on derived table-specific columns, which would be a very different query plan to this one.)
Thus I would consider indexing
Only on owner_id, assuming this gives a good degree of selectivity by itself,
Or, on the combination (owner_id, owner_type) only if there is evidence that index #1 isn't selective, AND if the the combination of the 2 fields gives sufficient selectivity to warrant this the index.
I have three tables. My news have on or several categories.
News
-------------------------
| id | title | created
Category
-------------------------
| id | title
News_Category
-------------------------
| news_id | category_id
But i have many rows on News about 10,000,000 rows. Using joind for fetch data will be performance issue.
Select title from News_Category left join News on (News_Category.news_id = News.id)
group by News_Category.id order by News.created desc limit 10
I want to have best query for this issue. For many to many relation data in huge tables which query have better performance.
Please give me the best query for this use case.
The best performance for that query, is given by permanently store it. This is you need a materialized view.
On MySQL you can implement the materialized view by create a table.
this is
create table FooMaterializedView as
(select foo1.*, foo2.* from foo1 join foo2 on ( ... ) where ... order by ...);
and now depending on how often the source tables change (this is receive inserts, updates or deletes) and how much you need to use the latest version of the query you need to implement suitable view maintenance strategy.
This is, depending of your needs and the problem itself perform:
full computation (i.e. truncate the materialized view and generate it again from scratch) might be enough
incremental computation. If it is too costly to the system perform a full computation very often, you must capture only the changes on the source tables and update the materialized view according to the changes.
If you need to take the incremental approach, I can only wish you the best luck. I can point you that you can use triggers to capture the changes on the source tables, and you will need to either use an algorithmic or an equalization approach to compute the changes to make to the materialized view.
I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?
As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.
You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.
Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.
If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).
You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.
You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.
You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person
Hope this helps,
Evert
Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.
However something's not right here, why are you doing a filesort for 271 rows?
Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.
sort_buffer should also have a sane value.