Lots of small mysql table or one big table - mysql

I have a forum where i have properties like - >
follow,voteup,votedown,report,favorite,view etc for each thread,answers,comments.
Which approach will be performance wise faster and better ?
I am expecting billions of favorite,views etc....just like youtube
Approach One
Make one big table counter
counter_id | user_id | object_id | object_type | property
where object_type = thread,comment,answer with their respective id from tables threads,comments,answers
and property = follow,voteup,votedown,report etc
Approach Two
Make individual tables of follow,views,report etc
views
view_id | user_id | object_id | object_type
follows
follow_id | user_id | object_id | object_type

There is no single answer to this, its quite subjective.
Most commonly it's best to consider the use cases for your design. Think carefully about what these fields will be used for before you add them to any table. And don't think that you have to add a numeric primary key ("ID") to every table. A table for tracking follows is just fine with only the fields user id | object id | object type and all three fields contained in the primary key.
Its unlikely your code ever will be used with such performance constraints as youtube or even stack overflow. If it is you will most likely have remodelled the database by then.
However for the sake of the exercise consider where and how data is to be used...
I would have separate tables as follows
Follow
User feeds, probably needs its own table as most commonly it gets hit from anywhere (a bit like a global inbox). The follow should also have some flag or timestamp to show changes so that its very easy to evaluate when changes have occurred since the last time the user was online.......
This is because a user needs to see what they've followed as some sort of feed and other's need to see how many people have followed. But other's don't need to see who else has followed.
Vote up, Vote down
That's just vote and a +- flag. Do denormalize this... That is store BOTH a user's individual votes in a table and store a count of votes against object on a field on the object's table. That way you only ever check a single user's vote (they're own) for a page view. The counts are retrieved from the same row containing the content.
Again. A user needs to see what they've up/down voted. You need to check they're not voting twice. What matters is the final count. So checking an object with a million up votes should not have to hit a million rows - Just one.
Pro tip: Some database engines perform badly if you constantly update rows with large content. So consider a "meta-data" table for all objects. Which stores counts such as this. This leaves the meta data free to update frequently even if the content doesn't.
Favorite
Own table again. user id | object id | object type. If you want to display number of favourites to the public then keep a count of this against the object, don't do a select count(*) every page view.
View
Why even store this? Keep a count against the object. If you're going to store a history then make sure you put a timestamp against it and purge it regularly. You don't need to store what a user was looking at six months ago.
As a general observation all of these are separate tables with the exception of up and down votes.
You should denormalize the counts to reduce the quantity of data your server needs to access to determine a page view. Most commonly a page view should be the fastest thing. Any form of update can be a little slower.
Where I mention for favourites and others that they don't need an additional primary key field. What I mean is that they have a primary key, just not an additional field. For example favourites could be:
CREATE TABLE favourites (
user INT,
object_type INT,
object_id INT,
PRIMARY KEY (user, object_type, object_id)
)
There's simply no reason to have a favorite_id field.

Answer, Part 1: Plan on redesigning as you go.
The best advice I can give you is to plan for change. What you design for the first million will not work for 30 million. The 30-million design will not survive to a billion. Whatever you do after reading this thread may last you through 30K rows.
Why is this? Well, partially because you will not be able to do it in a single machine. Don't shard your database now, but keep in the back of your mind that you will need to shard it. At that point, much of what worked on a single machine will either not work on multiple machines, or will be too slow to work. So you will have to redesign.
Let me point out another aspect of 1 billion rows. Think how fast you have to do INSERTs to grow a table to 1B rows in 1 year. It's over 30 per second. That's not bad, until you factor in the spikes you will get.
And what will happen when your second billion won't fit on the disk you have laid out?
Anyone who grows to a billion rows has to learn as he goes. The textbooks don't go there; the manuals don't go there; only the salesmen go there, but they don't stick around after the check clears. Look at YouTube (etc) -- almost nothing is "off the shelf".
And think of how many smart designers you will need to hire to get to 1 billion.
It is painful to add a column to a billion-row table, so (1) plan ahead, and (2) design a way to make changes without major outages.
Answer, Part 2: Some tips
Here are some of my comments on the ideas bounced around, and some tips from someone who has dealt with a billion-row, sharded system (not YouTube, but something similar).
Normalize vs denormalize: My motto: "Normalize, but don't overnormalize." You'll see what I mean after you have done some of it.
One table vs many: Two tables with the essentially identical CREATE TABLEs should usually be a single table. (Sharding, of course, violates that.) OTOH, if you need thousands of UPDATE...view_count = view_count + 1 per second it won't survive to a billion. However, it might survive to a million; then plan for change.
Minimize the size of datatypes -- Using a MEDIUMINT instead of an INT for one column saves a gigabyte.
Do not paginate using OFFSET and LIMIT. (I have a blog on a workaround.)
Batch INSERTs where possible.
Use InnoDB, you don't want to wait hours for a REPAIR to finish on a MyISAM table.
The simple task of getting a unique ID for the 'next' item can be a huge problem in a sharded system. Wait until you are closer to needing sharding before redesigning that part. Do not use UUIDs for a billion-row table; they will perform poorly. So, don't even think about UUIDs now; you will have throw them away.
Long before you hit 1 billion, you will have nightmares about the one machine crashing. Think about replication, HA, etc, early. It is painful to set up such after you have big tables.

Related

Divide SQL data in different tables

I want to split users' data in different tables so that there isn't an huge one containing all data...
The problem is that in tables different from the main one I can't recognize who each data belongs to.
Should I store the same user id in every table during the signup? Doesn't it create unnecessary duplicates?
EDIT:
example
table:
| id | user | email | phone number| password | followers | following | likes | posts |
becomes
table 1:
| id | user | email | phone number| password |
table 2:
| id | followers num | following num | likes num | posts num |
This looks like a "XY problem".
You want to "not have a huge table". But why is it that you have this requirement?
Probably it's because some responses in some scenarios are slower than you expect.
Rather than split tables every which way, which as Gordon Linoff mentioned is a SQL antipattern and liable to leave you more in the lurch than before, you should monitor your system and measure the performances of the various queries you use, weighing them by frequency. That is, if query #1 is done one hundred thousand times per period and takes 0.2 seconds, that's 20,000 seconds you should chalk up to query #1. Query #2 which takes fifty times longer - ten full seconds - but is only run one hundred times will only accrue one twentieth of the total time of the first.
(Since long delays are noticeable by the end users, some use a variation of this formula in which you multiply the instances of one query for the square - or higher powers - of its duration in milliseconds. This way, slower queries are brought more attention to).
Be it what may, once you know which queries you should optimize first, then you can start optimizing your schema.
The first thing to check are indexes. And maybe normalization. Those cover a good two thirds of the "low performing" cases I have met so far.
Then there's segmentation. Not in your case maybe, but you might have a table of transactions or such where you're usually only interested in the current solar or fiscal year. Adding a column with that information will make the table larger, but selecting only those records that at minimum match a condition on the year will make most queries run much faster. This is supported at a lower level also (see "Sharding").
Then there are careless JOINs and sub-SELECTs. Usually they start small and fast, so no one bothers to check indexes, normalization or conditions on those. After a couple of years, the inner SELECT is gathering in one million records, and the outer JOIN discards nine hundred and ninety-nine thousand of them. Translate the discarding condition inside the subselect and see the query take off.
Then you can check whether some information is really rarely accessed (for example, I have one DB where each user has a bunch of financial information, but this is only needed in maybe 0.1% of requests. So in that case yes, I have split that information in a secondary table, also gaining the possibility of supporting users with multiple bank accounts enrolled in the system. That was not why I did it, mind you).
In all this, also take into account time and money. Doing the analysis, running the modifications and checking them out, plus any downtime, is going to cost something and possibly even increase maintenance costs. Maybe - just maybe - throwing less money than that into a faster disk or more RAM or more or faster CPUs might achieve the same improvements without any need to alter either the schema or your code base.
I think you want to use a LEFT JOIN
SELECT t1.[user], t2.[posts]
FROM Table1 AS t1
LEFT JOIN Table2 AS t2 ON t1.id= t2.id
EDIT: Here is a link to documentation that explains different types of JOINS
I believe I understand your question and if you are wondering, you can use a foreign key. When you have a list of users, make sure that each user has a specific id.
Later, when you insert data about a user you can insert the users id via a session variable or a get request. (insert into different table)
Then, when you need to pull data for that specific user from that different table/s, you can just select from table where id = session[id] or get[id]
does that help?
answer: use foreign key to identify users data using gets and sessions
don't worry about duplicates if you are removing those values form the main table.
One table would probably have an AUTO_INCREMENT for the PRIMARY KEY; the other table would have the identical PK, but it would not be AUTO_INCREMENT. JOINing the tables will put the tables "back together" for querying.
There is rarely a good reason to "vertically partition" a table. One rare case is to split out the "like_count" or "view_count". This way the main table would not be bothered by the incessant UPDATEing of the counters. In some extreme cases, this may help performance.

Optimizing a mysql query to fetch "unseen" entries per user

This title is rather mesmerizing but I couldn't come up with something clearer.
Long story short, we're creating a mobile app connected to a node.js server communicating with a mySql database. Pretty common setup. Now, we have multiple users connected that are able to upload "moments" to our servers. These moments can be only seen once by all other users.
As soon as a user x sees another user y's moment, x cannot see this one y's moment, ever. Maybe a bit like Snapchat, except the moment is single user to multiple users instead of single to single. Moments are also ordered by distance according to the current user's location.
Now, I'm looking for an intelligent way of only fetching the "unseen" moments from database. For now, we're using a relational table between Users and Moments.
Let's say a user (ID = 20) sees a moment (ID = 30320), then we insert into this table 20 and 30320. I know. This is hardly scalable and probably a terrible idea.
I thought about maybe checking the last seen date and only fetching moments that are past this date, but again, moments are ordered by distance before being ordered by date so it is possible to see a moment that is 3 minutes old followed by a moment that is 30 seconds old.
Is there a more clever way of doing this, or am I doomed to use a relationship table between Moments and Users, and join to it when querying?
Thanks a lot.
EDIT -
This logic uses in total 3 tables.
Users
Moments
MomentSeen
MomentSeen only contains what user has seen what moment, and when. Since the moments aren't ordered by date, I can't fetch all the moments that were uploaded after the last seen moment.
EDIT -
I just realized the mobile app Tinder must use similar logic for which user "liked" which other user. Since you can't go back in time and see a user twice, they probably use a very similar query as what I'm looking for.
Considering they have a lot of users, and that they're ordered by distance and some other unknown criteria, there must be a more clever way of doing things than a "UserSawUser" relational table.
EDIT
I can't provide the entire database structure so I'll just leave the important tables and some of their fields.
Users {
UserID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY
}
Moments {
MomentID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
UploaderID INT UNSIGNED, /* FK to UserID */
TimeUploaded DATE /* usually NOW() while insertion */
}
MomentSeen {
/* Both are FK to Users and Moments */
MomentID INT UNSIGNED,
UserID INT UNSIGNED
}
You can consider implementing bloom filter. It is widely used to reduce disk seeks and drive better performance.
Medium is using it to check if a user has read a post already.
More details here-
https://medium.com/the-story/what-are-bloom-filters-1ec2a50c68ff
https://en.wikipedia.org/wiki/Bloom_filter
Do not use one table per user. Do have a single table for the moments.
You seem to have two conflicting orderings for "moments": 'distance' and 'unseen'; which is it?
If it is 'unseen', are the 'moments' numbered chronologically? This implies that each user has a last_moment_seen -- all Moments before then have been seen; all after that have not been seen. So...
SELECT ...
WHERE moment > ( SELECT last_moment_seen
FROM Users WHERE user_id = $user_id );
would get all the moments not yet seen for a given user.
Munch on that for a while; the come back for more suggestions.
Edit
This should give you the Moments not yet seen. You can then order them as you see fit.
SELECT m....
FROM Moments m
LEFT JOIN MomentSeen ms ON ms.MomentID = m.MomentID
WHERE ms.MomentID IS NULL
ORDER BY ...
LIMIT 1 -- if desired
Why hesitate on using join?
Have you try to fill your database with dummy data. million of them to be able to measure performance impact on your system?
Using joins is not such a bad idea and is often faster than a single table, if done right.
You should probably do a research on database design to provide some reference.
For instance, ordering a table is done using an index.
However, you could use more than one index on one table and a combination of column in each index.
This can be done by analyzing the query(s) often used against the table.
A handy recipe is by creating index containing a set of column used in join keys, another index for each possible "where" parameter combination as a set of column index, and a set of index column(s) for each "order" query run against that table (the ascending / descending order did matter).
So don't be shy to add another column and index to suit your need.
If you talking about scalability, then you should consider tuning the database engine.
eg. make a higher possible key using big integer.
using a database cluster setup would also require to do in depth analysis, because autoincrement keys have issue in a multiple master setup
If you try to squeeze more performance from your system, then you should consider designing the whole database to be table partition friendly from the very start. That will include serious analysis of your business rule. Creating a table partition friendly environment require setting up a series of column as key, and splitting the data physically (remember to set file_per_table = 1 on mysql config, otherwise the benefit of table partitioning is lost)
If not done right, however, partitioning will not do any benefit to you.
https://dev.mysql.com/doc/refman/5.5/en/partitioning.html

MySQL performance; large data table or multiple data tables?

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this

Having a column 'number_of_likes' or have a separate column...?

In my project, I need to *calculate 'number_of_likes' for a particular comment*.
Currently I have following structure of my comment_tbl table:
id user_id comment_details
1 10 Test1
2 5 Test2
3 7 Test3
4 8 Test4
5 3 Test5
And I have another table 'comment_likes_tbl' with following structure:
id comment_id user_id
1 1 1
2 2 5
3 2 7
4 1 3
5 3 5
The above one are sample data.
Question :
On my live server there are around 50K records. And I calculate the *number_of_likes to a particular comment by joining the above two tables*.
And I need to know Is it OK?
Or I should have one more field to the comment_tbl table to record the number_of_likes and increment it by 1 each time it is liked along with inserting it into the comment_likes_tbl....?
Doed it help me by anyway...?
Thanks In Advance.....
Yes, You should have one more field number_of_likes in the comment_tbl table. It will reduce the unnecessary joining of tables.
This way you don't need join until you need to get who liked the comment.
A good example you can see here is the database design of StackOverflow itself. See the Users Table they have a field Reputation with the Users table itself. Instead of Joining and calculating User's reputation every time they use this one.
You can take a few different approaches to something like this
As you're doing at the moment, run a JOIN query to return the collated results of comments and how many "likes" each has
As time goes on, you may find this is a drain on performance. Instead you could simply have a counter that increments attached to each comment field. But you may find it useful to also keep your *comment_likes_tbl* table, as this will be a permanent record of who liked what, and when (otherwise, you would just have a single figure with no additional metadata attached)
You could potentially also have a solution where you simply store your user's likes in the comment_likes_tbl, and then a cron task will run, on a pre-determined schedule, to automatically update all "like" counts across the board. Further down the line, with a busier site, this could potentially help even out performance, even if it does mean that "like" counts lag behind the real count slightly.
(on top of these, you can also implement caching solutions etc. to store temporary records of like values attached to comments, also MySQL has useful caching technology you can make use of)
But what you're doing just now is absolutely fine, although you should still make sure you've set up your indexes correctly, otherwise you will notice performance degradation more quickly. (a non-unique index on comment_id should suffice)
Use the query - as they are foreign keys the columns will be indexed and the query will be quick.
Yes, your architecture is good as it is and I would stick to it, for the moment.
Running too many joins can be a problem regarding performance, but as long as you don't have to face such problems, you shouldn't take care about it.
Even if you will ran into performance problems you should first,...
check if you use (foreign) keys, so that MySQL could lookup the data very fast
take advantage of MySQL Query cache
use some sort of 2nd caching layer, like memcached to store the value of likes (as this is only an incremental value).
The usage of memcache would solve your problem running too many joins and avoid to create a not really necessary column.

How to store the specific (polls eg.) data in a MySQL database?

Let's say I would like to store votes to polls in mysql database.
As far as I know I have two options:
1. Create one table (let's say votes) with fields like poll_id, user_id, selected_option_id, vote_date and so on..
2. Create a new database for votes (let's say votes_base) and for each poll add a table to this base (a table, which consist the id of the poll in the name), let's say poll[id of the poll].
The problem with the first option is that the table will become big very soon. Let's say I have 1000 polls and each poll has 1000 votes - that's already a million records in the table. I don't know how much of the speed performance that will costs.
The problem with the second option is I'm not sure if this is the correct solution from the programming rules point of view. But I'm sure with this option it will be (much?) faster to find all votes to some poll.
Or maybe there is a better option?
Your first option is the better option. It is structurally more sound. Millions of rows in a table is no problem from MySQL. A new table per poll is an antipattern.
EDIT for first comment:
Even for a billion or more votes, MySQL should handle. Indexes are the key here. What is the difference between one database with 100 times the same table, or one table with 100 times the rows?
Technically, the second option works as well. Sometimes it might be even better. But we frequently see this:
Instead of one table, users, with 10 columns
Make 100 tables, users_uk, users_us, ... depending on where the users are from.
Great, no? Works, yes? Well it does, until you want to select all the male users, or join the users table onto another table. You'll have a huge UNION coming, and you won't even know the tables beforehand.
One big users table, with the appropriate indexes, is better. If it gets too big for your liking (or your disk), you can start with PARTITIONING: you still have the benefit of one table, but the partitions are stored on different locations.
Now, with your polls, these kind of queries might not happen. In that case, one big InnoDB table or 1000s of small tables might both work.. but the first option is a lot easier to program, and has no drawbacks over the second option. Why choose the second option?
The first option is the better, no doubt. Just be sure to define INDEXes for fields you will use to search data (such as poll_id, for sure) and you will not experience performance issues. MySQL is a DBMS perfectly capable to handle such amount of rows. Do not worry.
First option is better. And you can archive tables after a while, if you not going to use it often