I am thinking of database schema for post and its comments, in context of a social networking application and im wandering which of these two would give better performance:
I am storing comments of a post in "Comments" Table and posts in the "Posts" Table.
Now my schema for the comments table looks like this:
postId commentId postedBy Date CommentBody
Since in order to retrieve the comments of a post I would be required to search all posts whose postId matches postId of this specific post and even my postId could not become primary key since the postId would be non unique within the column(since several comments for a single post), therefore I was thinking if I could merge postId and commentId into one single commentId (this becomes primary key) using which postId could also be retrieved. This is how I am thinking:
CommentId would be generated as postId*100+i (where i is the ith comment on the post)
thus in order to retrieve comments for a post(say with postId=8452 ) I would search all posts with commentId(that would be primary key), lying between 845200 & 845299.. instead of searching all comments with postId=8452.. (of course this limits the maximum no of comments to 100). But will this lead to any performance gains?
Here's what you do. Load up a database with representative data at (for example) twice the size you ever expect it to get.
Then run your queries and test them against both versions of the schema.
Then, and this is the good bit, retest this every X weeks with new up-to-date data to ensure the situation hasn't changed.
That's what being a DBA is all about. Unless your data will never change, database optimisation is not a set-and-forget operation. And the only way to be sure is to test under representative conditions.
Everything else is guesswork. Educated guesswork, don't get me wrong, but I'd rather have a deterministic answer in preference to anyone's guess, especially since the former will adapt to changes.
My favorite optimisation mantra is "Measure, don't guess!"
I'd recommend:
Use two-table structure with composite key in comments for best uniquness in index.
100 comments per article is a bad limition that may hit you in the back.
Dont use different tables for comments regarding video/pictures etc.
If huge amounts of comments, add an comment-archive table and move old comments
there. Most requested comments (newest) will have a smaller and more efficient table.
Do save blobs (pictures and videos) on different partition and not in db. Db will be smaller and less fragmented at file level.
regards,
/t
If you gonna get big volume you should make a table Post and a table Comments in order to have smaller table :). And don't forget to use index and partitions on them.
Use a composite key. Or, if you're using some framework that only allows single-column keys, a secondary index on postId
If CommendId is not unique, you can create a composite PRIMARY KEY on (postId, CommentID):
CREATE TABLE Comment
(
postId INT NOT NULL,
commentId INT NOT NULL,
…,
PRIMARY KEY (postId, commentId)
)
If your table is MyISAM, you can mark commentId as AUTO_INCREMENT, which will assign it with a per-post UNIQUE incrementing value.
If it is unique, you can create a PRIMARY KEY on CommentId and create a secondary index on (PostId, CommentId):
CREATE TABLE Comment
(
commentId INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
postId INT NOT NULL,
…,
KEY (postId, commentId)
)
CommentId would be generated as postId*100+i (where i is the ith comment on the post)
thus inorder to retrieve comments for a post(say with postId=8452 ) I would search all posts with commentId(that would be primary key), lying between 845200 & 845299.. instead of searching all comments with postId=8452.. (ofcourse this limits the maximum no of comments to 100). But will this lead to any performance gains ??
This will likely give much worse performance than a query based on a postId foreign key column, but the only way to be sure is to try both techniques (as suggested by paxdiablo) and measure the performance.
Related
I'm developing SaaS app with multi-tenancy, and i've decide to use single DB (MySQL Innodb for now) for client's data. I chose to use composite primary keys like PK(client_id, id). I have 2 ways here,
1: increment "id" by myself (by trigger or from code)
or
2: Make "id" as auto-increment by mysql.
In first case i will have unique id's for each client, so each client will have id 1, 2, 3 etc..
In second case id's will grow for all clients.
What is the best practice here? My priorities are: performace, security & scaling. Thanks!
You definitely want to use autoincrementing id values as primary keys. There happen to be many reasons for this. Here are some.
Avoiding race conditions (accidental id duplication) requires great care if you generate them yourself. Spend that mental energy -- development, QA, operations -- on making your SaaS excellent instead of reinventing the flat tire on primary keys.
You can still put an index on (client_id, id) even if it isn't the PK.
Your JOIN operations will be easier to write, test, and maintain.
This query pattern is great for getting the latest row for each client from a table. It performs very well. It's harder to do this kind of thing if you generate your own pks.
SELECT t.*
FROM table t
JOIN (SELECT MAX(id) id
FROM table
GROUP BY client_id
) m ON t.id = m.id
"PK(client_id, id)" --
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(client_id, id),
INDEX(id)
Yes, that combination will work. And it will work efficiently. It will not assign 1,2,3 to each client, but that should not matter. Instead, consecutive ids will be scattered among the clients.
Probably all of your queries will include WHERE client_id = (constant), correct? That means that PRIMARY KEY(client_id, id) will always be used and INDEX(id) won't be used except for satisfying AUTO_INCREMENT.
Furthermore, that PK will be more efficient than having INDEX(client_id, id). (This is because InnoDB "clusters" the PK with the data.)
I'm creating a table on a database that has different poll options. There is another table with polls.
The idea is that given a poll_id I want to get as fast as possible all its options.
This are the table columns: opt_id, poll_id, opt_text, opt_votes.
I would like the opt_id not to be an auto_increment but just the id (1 to N options) within the poll, so to me the primary key is given by both the poll_id and the option_id, right?
What I want is to have a proper index so that a query such SELECT * FROM options WHERE poll_id=X takes as less as possible, but I don't know if just by setting the primary key to these two fields is enough or I have to set an index somewhere.
For SELECT * FROM options WHERE poll_id=X, INDEX(poll_id) is optimal. If you already have PRIMARY KEY(poll_id), then that is sufficient. (A PRIMARY KEY is a UNIQUE KEY which is an INDEX.)
Index Cookbook .
Please provide SHOW CREATE TABLE; there is too much hand-waving in your description of the tables.
And show us any other SELECTs; they may need other indexes.
Sorry, not sure if question title is reflects the real question, but here goes:
I designing system which have standard orders table but with additional previous and next columns.
The question is which approach for foreign keys is better
Here I have basic table with following columns (previous, next) which are self referencing foreign keys. The problem with this table is that the first placed order doesn't have previous and next fields, so they left out empty, so if I have say 10 000 records 30% of them have those columns empty that's 3000 rows which is quite a lot I think, and also I expect numbers to grow. so in a let's say a year time period it can come to 30000 rows with empty columns, and I am not sure if it's ok.
The solution I've have came with is to main table with other 2 tables which have foreign keys to that table. In this case those 2 additional tables are identifying tables and nothing more, and there's no longer rows with empty columns.
So the question is which solution is better when considering query speed, table optimization, and common good practices, or maybe there's one even better that I don't know? (P.s. I am using mysql with InnoDB engine).
If your aim is to do order sets, you could simply add a new table for that, and just have a single column as a foreign key to that table in the order table.
The orders could also include a rank column to indicate in which order orders belonging to the same set come.
create table order_sets (
id not null auto_increment,
-- customer related data, etc...
primary key(id)
);
create table orders (
id int not null auto_increment,
name varchar,
quantity int,
set_id foreign key (order_set),
set_rank int,
primary key(id)
);
Then inserting a new order means updating the rank of all other orders which come after in the same set, if any.
Likewise, for grouping queries, things are way easier than having to follow prev and next links. I'm pretty sure you will need these queries, and the performances will be much better that way.
1 database with 3 tables: user - photo - vote
- A user can have many photos.
- A photo can have many votes.
- A user can vote on many photos.
- A vote records:
. the result as an int (-1/disliked, 0/neutral, 1/liked)
. the id of the user who voted.
Here is what I have (all FKs are cascade on delete and update):
http://grab.by/iZYE (sid = surrogate id)
My question is: this doesn't seem right, and I look at this for 2 days already and can't confidently move on. How can I optimize this or am I completely wrong?
MySQL/InnoDB tables are always clustered (more on clustering here and here).
Since primary key also acts as a clustering key1, using the surrogate primary key means you are physically sorting the table in order that doesn't have a useful meaning for the client applications and cannot be utilized for querying.
Furthermore, secondary indexes in clustered tables can be "fatter" than in heap-based tables and may require double lookup.
For these reasons, you'd want to avoid surrogates and use more "natural" keys, similar to this:
({USER_ID, PICTURE_NO} in table VOTE references the same-named fields in PICTURE. The VOTE.VOTER_ID references USER.USER_ID. Use integers for *_ID and *_NO fields if you can.)
This physical model will enable extremely efficient querying for:
Pictures of the given user (a simple range scan on PICTURE primary/clustering index).
Votes on the given picture (a simple range scan on VOTE primary/clustering index). Depending on circumstances, this may actually be fast enough so you don't have to cache the sum in PICTURE.
If you need votes of the given user, change the VOTE PK to: {VOTER_ID, USER_ID, PICTURE_NO}. If you need both (votes of picture and votes of user), keep the existing PK, but create a covering index on {VOTER_ID, USER_ID, PICTURE_NO, VOTE_VALUE}.
1 In InnoDB. There are DBMSes (such as MS SQL Server) where clustering key can differ from primary.
The first thing I see is that you have duplicate unique IDs on the tables. You don't need the sid columns; just use user_id, photo_id, and photo_user_id (maybe rename this one to vote_id). Those ID columns should also be INT type, definitely not VARCHARs. You probably don't need the vote total columns on photo; you can just run a query to get the total when you need it and not worry about keeping both tables in sync.
Assuming that you will only allow one vote per user on each photo, the structure of the can be modified so the only columns are user_id, photo_id, and vote_result. You would then make the primary key a composite index on (user_id, photo_id). However, since you're using foreign keys, that makes this table a bit more complicated.
I'm getting all the likes of the current user, and storing them to a table (user_id, liked_id). The problem is, when I get all the likes again and if there is a change, I just want to insert the new likes. How can I do this efficiently since many users have lots of likes?
Make the (user_id, liked_id) the clustered, primary key of the table. Use a fill-factor of the index that makes room for new pairs and make sure that your update clauses can make efficient use of the clustered index (i.e. always include the user_id in the where clause).
Yes you can make composite/ combined primary key or make combined of both field as unique.
So, it will not add the data that are already there due to key error.Hence only new data will be inserted.