In my project, I need to *calculate 'number_of_likes' for a particular comment*.
Currently I have following structure of my comment_tbl table:
id user_id comment_details
1 10 Test1
2 5 Test2
3 7 Test3
4 8 Test4
5 3 Test5
And I have another table 'comment_likes_tbl' with following structure:
id comment_id user_id
1 1 1
2 2 5
3 2 7
4 1 3
5 3 5
The above one are sample data.
Question :
On my live server there are around 50K records. And I calculate the *number_of_likes to a particular comment by joining the above two tables*.
And I need to know Is it OK?
Or I should have one more field to the comment_tbl table to record the number_of_likes and increment it by 1 each time it is liked along with inserting it into the comment_likes_tbl....?
Doed it help me by anyway...?
Thanks In Advance.....
Yes, You should have one more field number_of_likes in the comment_tbl table. It will reduce the unnecessary joining of tables.
This way you don't need join until you need to get who liked the comment.
A good example you can see here is the database design of StackOverflow itself. See the Users Table they have a field Reputation with the Users table itself. Instead of Joining and calculating User's reputation every time they use this one.
You can take a few different approaches to something like this
As you're doing at the moment, run a JOIN query to return the collated results of comments and how many "likes" each has
As time goes on, you may find this is a drain on performance. Instead you could simply have a counter that increments attached to each comment field. But you may find it useful to also keep your *comment_likes_tbl* table, as this will be a permanent record of who liked what, and when (otherwise, you would just have a single figure with no additional metadata attached)
You could potentially also have a solution where you simply store your user's likes in the comment_likes_tbl, and then a cron task will run, on a pre-determined schedule, to automatically update all "like" counts across the board. Further down the line, with a busier site, this could potentially help even out performance, even if it does mean that "like" counts lag behind the real count slightly.
(on top of these, you can also implement caching solutions etc. to store temporary records of like values attached to comments, also MySQL has useful caching technology you can make use of)
But what you're doing just now is absolutely fine, although you should still make sure you've set up your indexes correctly, otherwise you will notice performance degradation more quickly. (a non-unique index on comment_id should suffice)
Use the query - as they are foreign keys the columns will be indexed and the query will be quick.
Yes, your architecture is good as it is and I would stick to it, for the moment.
Running too many joins can be a problem regarding performance, but as long as you don't have to face such problems, you shouldn't take care about it.
Even if you will ran into performance problems you should first,...
check if you use (foreign) keys, so that MySQL could lookup the data very fast
take advantage of MySQL Query cache
use some sort of 2nd caching layer, like memcached to store the value of likes (as this is only an incremental value).
The usage of memcache would solve your problem running too many joins and avoid to create a not really necessary column.
Related
I want to split users' data in different tables so that there isn't an huge one containing all data...
The problem is that in tables different from the main one I can't recognize who each data belongs to.
Should I store the same user id in every table during the signup? Doesn't it create unnecessary duplicates?
EDIT:
example
table:
| id | user | email | phone number| password | followers | following | likes | posts |
becomes
table 1:
| id | user | email | phone number| password |
table 2:
| id | followers num | following num | likes num | posts num |
This looks like a "XY problem".
You want to "not have a huge table". But why is it that you have this requirement?
Probably it's because some responses in some scenarios are slower than you expect.
Rather than split tables every which way, which as Gordon Linoff mentioned is a SQL antipattern and liable to leave you more in the lurch than before, you should monitor your system and measure the performances of the various queries you use, weighing them by frequency. That is, if query #1 is done one hundred thousand times per period and takes 0.2 seconds, that's 20,000 seconds you should chalk up to query #1. Query #2 which takes fifty times longer - ten full seconds - but is only run one hundred times will only accrue one twentieth of the total time of the first.
(Since long delays are noticeable by the end users, some use a variation of this formula in which you multiply the instances of one query for the square - or higher powers - of its duration in milliseconds. This way, slower queries are brought more attention to).
Be it what may, once you know which queries you should optimize first, then you can start optimizing your schema.
The first thing to check are indexes. And maybe normalization. Those cover a good two thirds of the "low performing" cases I have met so far.
Then there's segmentation. Not in your case maybe, but you might have a table of transactions or such where you're usually only interested in the current solar or fiscal year. Adding a column with that information will make the table larger, but selecting only those records that at minimum match a condition on the year will make most queries run much faster. This is supported at a lower level also (see "Sharding").
Then there are careless JOINs and sub-SELECTs. Usually they start small and fast, so no one bothers to check indexes, normalization or conditions on those. After a couple of years, the inner SELECT is gathering in one million records, and the outer JOIN discards nine hundred and ninety-nine thousand of them. Translate the discarding condition inside the subselect and see the query take off.
Then you can check whether some information is really rarely accessed (for example, I have one DB where each user has a bunch of financial information, but this is only needed in maybe 0.1% of requests. So in that case yes, I have split that information in a secondary table, also gaining the possibility of supporting users with multiple bank accounts enrolled in the system. That was not why I did it, mind you).
In all this, also take into account time and money. Doing the analysis, running the modifications and checking them out, plus any downtime, is going to cost something and possibly even increase maintenance costs. Maybe - just maybe - throwing less money than that into a faster disk or more RAM or more or faster CPUs might achieve the same improvements without any need to alter either the schema or your code base.
I think you want to use a LEFT JOIN
SELECT t1.[user], t2.[posts]
FROM Table1 AS t1
LEFT JOIN Table2 AS t2 ON t1.id= t2.id
EDIT: Here is a link to documentation that explains different types of JOINS
I believe I understand your question and if you are wondering, you can use a foreign key. When you have a list of users, make sure that each user has a specific id.
Later, when you insert data about a user you can insert the users id via a session variable or a get request. (insert into different table)
Then, when you need to pull data for that specific user from that different table/s, you can just select from table where id = session[id] or get[id]
does that help?
answer: use foreign key to identify users data using gets and sessions
don't worry about duplicates if you are removing those values form the main table.
One table would probably have an AUTO_INCREMENT for the PRIMARY KEY; the other table would have the identical PK, but it would not be AUTO_INCREMENT. JOINing the tables will put the tables "back together" for querying.
There is rarely a good reason to "vertically partition" a table. One rare case is to split out the "like_count" or "view_count". This way the main table would not be bothered by the incessant UPDATEing of the counters. In some extreme cases, this may help performance.
I have a forum where i have properties like - >
follow,voteup,votedown,report,favorite,view etc for each thread,answers,comments.
Which approach will be performance wise faster and better ?
I am expecting billions of favorite,views etc....just like youtube
Approach One
Make one big table counter
counter_id | user_id | object_id | object_type | property
where object_type = thread,comment,answer with their respective id from tables threads,comments,answers
and property = follow,voteup,votedown,report etc
Approach Two
Make individual tables of follow,views,report etc
views
view_id | user_id | object_id | object_type
follows
follow_id | user_id | object_id | object_type
There is no single answer to this, its quite subjective.
Most commonly it's best to consider the use cases for your design. Think carefully about what these fields will be used for before you add them to any table. And don't think that you have to add a numeric primary key ("ID") to every table. A table for tracking follows is just fine with only the fields user id | object id | object type and all three fields contained in the primary key.
Its unlikely your code ever will be used with such performance constraints as youtube or even stack overflow. If it is you will most likely have remodelled the database by then.
However for the sake of the exercise consider where and how data is to be used...
I would have separate tables as follows
Follow
User feeds, probably needs its own table as most commonly it gets hit from anywhere (a bit like a global inbox). The follow should also have some flag or timestamp to show changes so that its very easy to evaluate when changes have occurred since the last time the user was online.......
This is because a user needs to see what they've followed as some sort of feed and other's need to see how many people have followed. But other's don't need to see who else has followed.
Vote up, Vote down
That's just vote and a +- flag. Do denormalize this... That is store BOTH a user's individual votes in a table and store a count of votes against object on a field on the object's table. That way you only ever check a single user's vote (they're own) for a page view. The counts are retrieved from the same row containing the content.
Again. A user needs to see what they've up/down voted. You need to check they're not voting twice. What matters is the final count. So checking an object with a million up votes should not have to hit a million rows - Just one.
Pro tip: Some database engines perform badly if you constantly update rows with large content. So consider a "meta-data" table for all objects. Which stores counts such as this. This leaves the meta data free to update frequently even if the content doesn't.
Favorite
Own table again. user id | object id | object type. If you want to display number of favourites to the public then keep a count of this against the object, don't do a select count(*) every page view.
View
Why even store this? Keep a count against the object. If you're going to store a history then make sure you put a timestamp against it and purge it regularly. You don't need to store what a user was looking at six months ago.
As a general observation all of these are separate tables with the exception of up and down votes.
You should denormalize the counts to reduce the quantity of data your server needs to access to determine a page view. Most commonly a page view should be the fastest thing. Any form of update can be a little slower.
Where I mention for favourites and others that they don't need an additional primary key field. What I mean is that they have a primary key, just not an additional field. For example favourites could be:
CREATE TABLE favourites (
user INT,
object_type INT,
object_id INT,
PRIMARY KEY (user, object_type, object_id)
)
There's simply no reason to have a favorite_id field.
Answer, Part 1: Plan on redesigning as you go.
The best advice I can give you is to plan for change. What you design for the first million will not work for 30 million. The 30-million design will not survive to a billion. Whatever you do after reading this thread may last you through 30K rows.
Why is this? Well, partially because you will not be able to do it in a single machine. Don't shard your database now, but keep in the back of your mind that you will need to shard it. At that point, much of what worked on a single machine will either not work on multiple machines, or will be too slow to work. So you will have to redesign.
Let me point out another aspect of 1 billion rows. Think how fast you have to do INSERTs to grow a table to 1B rows in 1 year. It's over 30 per second. That's not bad, until you factor in the spikes you will get.
And what will happen when your second billion won't fit on the disk you have laid out?
Anyone who grows to a billion rows has to learn as he goes. The textbooks don't go there; the manuals don't go there; only the salesmen go there, but they don't stick around after the check clears. Look at YouTube (etc) -- almost nothing is "off the shelf".
And think of how many smart designers you will need to hire to get to 1 billion.
It is painful to add a column to a billion-row table, so (1) plan ahead, and (2) design a way to make changes without major outages.
Answer, Part 2: Some tips
Here are some of my comments on the ideas bounced around, and some tips from someone who has dealt with a billion-row, sharded system (not YouTube, but something similar).
Normalize vs denormalize: My motto: "Normalize, but don't overnormalize." You'll see what I mean after you have done some of it.
One table vs many: Two tables with the essentially identical CREATE TABLEs should usually be a single table. (Sharding, of course, violates that.) OTOH, if you need thousands of UPDATE...view_count = view_count + 1 per second it won't survive to a billion. However, it might survive to a million; then plan for change.
Minimize the size of datatypes -- Using a MEDIUMINT instead of an INT for one column saves a gigabyte.
Do not paginate using OFFSET and LIMIT. (I have a blog on a workaround.)
Batch INSERTs where possible.
Use InnoDB, you don't want to wait hours for a REPAIR to finish on a MyISAM table.
The simple task of getting a unique ID for the 'next' item can be a huge problem in a sharded system. Wait until you are closer to needing sharding before redesigning that part. Do not use UUIDs for a billion-row table; they will perform poorly. So, don't even think about UUIDs now; you will have throw them away.
Long before you hit 1 billion, you will have nightmares about the one machine crashing. Think about replication, HA, etc, early. It is painful to set up such after you have big tables.
I was having an argument with a friend of mine. Suppose we have a db table with a userid and some other fields. This table might have a lot of rows. Let's suppose also that by design we limit the records for each userid in the table to about 50.My friend suggested that if I under every row for each userid one after another the lookup would be faster e.g
userid otherfield
1 .........
1 .........
.....until 50...
2 ........
etc. So when a user id 1 is created I pre-popopulate the 50 table's rows to with null values...etc. The idea is that if I know the amount of rows and find the first row with userid =1 I just have to look the next 49 an voila I don't have to search the whole table. Is this correct?can this be done without indexing? Is the pre-population an expensive process?Is there a performance difference if I just inserted in old fashioned way like
1 ........
2 ........
2 ........
1 ........
etc?
To answer a performance question like this, you should run performance tests on the different configurations.
But, let me make a few points.
First, although you might know that the records for a given id are located next to each other, the database does not know this. So, if you are searching for one user -- without an index -- then the engine needs to search through all the records (unless you have a limit clause in the query).
Second, if the data is fixed length (numeric and dates), the populating it with values after populating it with NULL values will occupy the same space on the page. But, if the data is variable length, then a given page will be filled with empty records. When you modify the records with real values, you will get page split.
What you are trying to do is to outsmart the database engine. This isn't necessary, because MySQL provides indexes, which provide almost all the benefits that you are describing.
Now, having said that, there is some performance benefit from having all the records for a user being co-located. If a user has 50 records, then reading the records with an index would typically require loading 50 pages into memory. If the records are co-located, then only one or two records would need to be read. Typically, this would be a very small performance gain, because most frequently accessed tables fit into memory. There might be some circumstances where the performance gain is worth it.
I just insert a data with a form in my website, normally the data will inserted in the last of rows like :
auto_increment name
1 a
2 b
3 c
4 d
5 e
but, when i insert a new data last time, it inserted in the middle rows of table, looked like :
17 data17
30 data30
18 data18
19 data19
20 data20
the newest data that has been inserted in the middle rows of table (data30).
it's happen to me rarerly (still happen) why this happen? and how i prevent this thing in in the future? thankyou.
What you see is the result returned by the engine. It is hardly a matter which recod is fetched early and which later as it depends on a lot of issues. For one, dont think your database table to be a sequential file like FoxPro. It is way more sophisticated than that. Next, for every query that returns data use a Order by clause to avoid these instances.
So always use:
select columns from table order by column
The above will ensure you get the data the way you need and not be surprised when the DB engine finds a later record in cache while fetches an older record from a slow media in another database file. If you read the basics of RDBMS concepts then these things are discussed as also you need to study how MySQL internally works.
I found this great article that discusses the many wonderful features of a modern database query engine.
http://thinkingmonster.wordpress.com/mysql/mysql-architecture/
Although the entire article discusses the topic very well but you may pay extra attention to the section that talks about Record Cache.
Let's say I would like to store votes to polls in mysql database.
As far as I know I have two options:
1. Create one table (let's say votes) with fields like poll_id, user_id, selected_option_id, vote_date and so on..
2. Create a new database for votes (let's say votes_base) and for each poll add a table to this base (a table, which consist the id of the poll in the name), let's say poll[id of the poll].
The problem with the first option is that the table will become big very soon. Let's say I have 1000 polls and each poll has 1000 votes - that's already a million records in the table. I don't know how much of the speed performance that will costs.
The problem with the second option is I'm not sure if this is the correct solution from the programming rules point of view. But I'm sure with this option it will be (much?) faster to find all votes to some poll.
Or maybe there is a better option?
Your first option is the better option. It is structurally more sound. Millions of rows in a table is no problem from MySQL. A new table per poll is an antipattern.
EDIT for first comment:
Even for a billion or more votes, MySQL should handle. Indexes are the key here. What is the difference between one database with 100 times the same table, or one table with 100 times the rows?
Technically, the second option works as well. Sometimes it might be even better. But we frequently see this:
Instead of one table, users, with 10 columns
Make 100 tables, users_uk, users_us, ... depending on where the users are from.
Great, no? Works, yes? Well it does, until you want to select all the male users, or join the users table onto another table. You'll have a huge UNION coming, and you won't even know the tables beforehand.
One big users table, with the appropriate indexes, is better. If it gets too big for your liking (or your disk), you can start with PARTITIONING: you still have the benefit of one table, but the partitions are stored on different locations.
Now, with your polls, these kind of queries might not happen. In that case, one big InnoDB table or 1000s of small tables might both work.. but the first option is a lot easier to program, and has no drawbacks over the second option. Why choose the second option?
The first option is the better, no doubt. Just be sure to define INDEXes for fields you will use to search data (such as poll_id, for sure) and you will not experience performance issues. MySQL is a DBMS perfectly capable to handle such amount of rows. Do not worry.
First option is better. And you can archive tables after a while, if you not going to use it often