Divide SQL data in different tables - mysql

I want to split users' data in different tables so that there isn't an huge one containing all data...
The problem is that in tables different from the main one I can't recognize who each data belongs to.
Should I store the same user id in every table during the signup? Doesn't it create unnecessary duplicates?
EDIT:
example
table:
| id | user | email | phone number| password | followers | following | likes | posts |
becomes
table 1:
| id | user | email | phone number| password |
table 2:
| id | followers num | following num | likes num | posts num |

This looks like a "XY problem".
You want to "not have a huge table". But why is it that you have this requirement?
Probably it's because some responses in some scenarios are slower than you expect.
Rather than split tables every which way, which as Gordon Linoff mentioned is a SQL antipattern and liable to leave you more in the lurch than before, you should monitor your system and measure the performances of the various queries you use, weighing them by frequency. That is, if query #1 is done one hundred thousand times per period and takes 0.2 seconds, that's 20,000 seconds you should chalk up to query #1. Query #2 which takes fifty times longer - ten full seconds - but is only run one hundred times will only accrue one twentieth of the total time of the first.
(Since long delays are noticeable by the end users, some use a variation of this formula in which you multiply the instances of one query for the square - or higher powers - of its duration in milliseconds. This way, slower queries are brought more attention to).
Be it what may, once you know which queries you should optimize first, then you can start optimizing your schema.
The first thing to check are indexes. And maybe normalization. Those cover a good two thirds of the "low performing" cases I have met so far.
Then there's segmentation. Not in your case maybe, but you might have a table of transactions or such where you're usually only interested in the current solar or fiscal year. Adding a column with that information will make the table larger, but selecting only those records that at minimum match a condition on the year will make most queries run much faster. This is supported at a lower level also (see "Sharding").
Then there are careless JOINs and sub-SELECTs. Usually they start small and fast, so no one bothers to check indexes, normalization or conditions on those. After a couple of years, the inner SELECT is gathering in one million records, and the outer JOIN discards nine hundred and ninety-nine thousand of them. Translate the discarding condition inside the subselect and see the query take off.
Then you can check whether some information is really rarely accessed (for example, I have one DB where each user has a bunch of financial information, but this is only needed in maybe 0.1% of requests. So in that case yes, I have split that information in a secondary table, also gaining the possibility of supporting users with multiple bank accounts enrolled in the system. That was not why I did it, mind you).
In all this, also take into account time and money. Doing the analysis, running the modifications and checking them out, plus any downtime, is going to cost something and possibly even increase maintenance costs. Maybe - just maybe - throwing less money than that into a faster disk or more RAM or more or faster CPUs might achieve the same improvements without any need to alter either the schema or your code base.

I think you want to use a LEFT JOIN
SELECT t1.[user], t2.[posts]
FROM Table1 AS t1
LEFT JOIN Table2 AS t2 ON t1.id= t2.id
EDIT: Here is a link to documentation that explains different types of JOINS

I believe I understand your question and if you are wondering, you can use a foreign key. When you have a list of users, make sure that each user has a specific id.
Later, when you insert data about a user you can insert the users id via a session variable or a get request. (insert into different table)
Then, when you need to pull data for that specific user from that different table/s, you can just select from table where id = session[id] or get[id]
does that help?
answer: use foreign key to identify users data using gets and sessions
don't worry about duplicates if you are removing those values form the main table.

One table would probably have an AUTO_INCREMENT for the PRIMARY KEY; the other table would have the identical PK, but it would not be AUTO_INCREMENT. JOINing the tables will put the tables "back together" for querying.
There is rarely a good reason to "vertically partition" a table. One rare case is to split out the "like_count" or "view_count". This way the main table would not be bothered by the incessant UPDATEing of the counters. In some extreme cases, this may help performance.

Related

Adding extra fields to prevent needing joins

In consideration of schema design, is it appropriate to add extra table fields I wouldn't otherwise need in order to prevent having to do a join? Example:
products_table
| id | name | seller_id
users_table
| id | username |
reviews_table
| id | product_id | seller_id |
For the reviews table, I could use a join on the products table to get the user id of the seller. If I leave it out of the reviews table, I have to use a join to get it. There are often tables where several joins are needed to get at some information where I could just have my app add redundant data to the table instead. Which is more correct in terms of schema design?
You seem overly concerned about the performance of JOINs. With proper indexing, performance is not usually an issue. In fact, there are situations where JOINs are faster -- because the data is more compact in two tables than storing the fields over and over and over again (this applies more to strings than to integers, though).
If you are going to have multiple tables, then use JOINs to access the "lookup" information. There may be some situations where you want to denormalize the information. But in general, you don't. And premature optimization is the root of a lot of bad design.
Suppose you add a column reviews.seller_id and you populate it with values, and then some weeks later you find that the values aren't always the same as the seller in the products_table.
In other words, the following query should always return a count of 0, but what if one day it returns a count of 6?
SELECT COUNT(*)
FROM products_table AS p
JOIN reviews_table AS r USING (product_id)
WHERE p.seller_id <> r.seller_id
Meaning there was some update of one table, but not the other. They weren't both updated to keep the seller_id in sync.
How did this happen? Which table was updated, and which one still has the original seller_id? Which one is correct? Was the update intentional?
You start researching each of the 6 cases, verify who is the correct seller, and update the data to make them match.
Then the next week, the count of mismatched sellers is 1477. You must have a bug in your code somewhere that allows an update to one table without updating the other to match. Now you have a much larger data cleanup project, and a bug-hunt to go find out how this could happen.
And how many other times have you done the same thing for other columns -- copied them into a related table to avoid a join? Are those creating mismatched data too? How would you check them all? Do you need to check them every night? Can they be corrected?
This is the kind of trouble you get into when you use denormalization, in other words storing columns redundantly to avoid joins, avoid aggregations, or avoid expensive calculations, to speed up certain queries.
In fact, you don't avoid those operations, you just move the work of those operations to an earlier time.
It's possible to make it all work seamlessly, but it's a lot more work for the coder to develop and test the perfect code, and fix the subsequent code bugs and inevitable data cleanup chores.
This depends on each specific case. Purely in terms of schema design, you should not have any redundant columns (see database normalization). However, in a real case scenario, sometimes it makes sense to have redundant data; for example, when having performance issues, you can sacrifice some memory in order to make SELECT queries faster.
Adding redundant column today will make you curse tomorrow.If you Handle keys in database properly, performance will not penalize you.

Does MySQL table size matters when doing JOINs?

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.
First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Lots of small mysql table or one big table

I have a forum where i have properties like - >
follow,voteup,votedown,report,favorite,view etc for each thread,answers,comments.
Which approach will be performance wise faster and better ?
I am expecting billions of favorite,views etc....just like youtube
Approach One
Make one big table counter
counter_id | user_id | object_id | object_type | property
where object_type = thread,comment,answer with their respective id from tables threads,comments,answers
and property = follow,voteup,votedown,report etc
Approach Two
Make individual tables of follow,views,report etc
views
view_id | user_id | object_id | object_type
follows
follow_id | user_id | object_id | object_type
There is no single answer to this, its quite subjective.
Most commonly it's best to consider the use cases for your design. Think carefully about what these fields will be used for before you add them to any table. And don't think that you have to add a numeric primary key ("ID") to every table. A table for tracking follows is just fine with only the fields user id | object id | object type and all three fields contained in the primary key.
Its unlikely your code ever will be used with such performance constraints as youtube or even stack overflow. If it is you will most likely have remodelled the database by then.
However for the sake of the exercise consider where and how data is to be used...
I would have separate tables as follows
Follow
User feeds, probably needs its own table as most commonly it gets hit from anywhere (a bit like a global inbox). The follow should also have some flag or timestamp to show changes so that its very easy to evaluate when changes have occurred since the last time the user was online.......
This is because a user needs to see what they've followed as some sort of feed and other's need to see how many people have followed. But other's don't need to see who else has followed.
Vote up, Vote down
That's just vote and a +- flag. Do denormalize this... That is store BOTH a user's individual votes in a table and store a count of votes against object on a field on the object's table. That way you only ever check a single user's vote (they're own) for a page view. The counts are retrieved from the same row containing the content.
Again. A user needs to see what they've up/down voted. You need to check they're not voting twice. What matters is the final count. So checking an object with a million up votes should not have to hit a million rows - Just one.
Pro tip: Some database engines perform badly if you constantly update rows with large content. So consider a "meta-data" table for all objects. Which stores counts such as this. This leaves the meta data free to update frequently even if the content doesn't.
Favorite
Own table again. user id | object id | object type. If you want to display number of favourites to the public then keep a count of this against the object, don't do a select count(*) every page view.
View
Why even store this? Keep a count against the object. If you're going to store a history then make sure you put a timestamp against it and purge it regularly. You don't need to store what a user was looking at six months ago.
As a general observation all of these are separate tables with the exception of up and down votes.
You should denormalize the counts to reduce the quantity of data your server needs to access to determine a page view. Most commonly a page view should be the fastest thing. Any form of update can be a little slower.
Where I mention for favourites and others that they don't need an additional primary key field. What I mean is that they have a primary key, just not an additional field. For example favourites could be:
CREATE TABLE favourites (
user INT,
object_type INT,
object_id INT,
PRIMARY KEY (user, object_type, object_id)
)
There's simply no reason to have a favorite_id field.
Answer, Part 1: Plan on redesigning as you go.
The best advice I can give you is to plan for change. What you design for the first million will not work for 30 million. The 30-million design will not survive to a billion. Whatever you do after reading this thread may last you through 30K rows.
Why is this? Well, partially because you will not be able to do it in a single machine. Don't shard your database now, but keep in the back of your mind that you will need to shard it. At that point, much of what worked on a single machine will either not work on multiple machines, or will be too slow to work. So you will have to redesign.
Let me point out another aspect of 1 billion rows. Think how fast you have to do INSERTs to grow a table to 1B rows in 1 year. It's over 30 per second. That's not bad, until you factor in the spikes you will get.
And what will happen when your second billion won't fit on the disk you have laid out?
Anyone who grows to a billion rows has to learn as he goes. The textbooks don't go there; the manuals don't go there; only the salesmen go there, but they don't stick around after the check clears. Look at YouTube (etc) -- almost nothing is "off the shelf".
And think of how many smart designers you will need to hire to get to 1 billion.
It is painful to add a column to a billion-row table, so (1) plan ahead, and (2) design a way to make changes without major outages.
Answer, Part 2: Some tips
Here are some of my comments on the ideas bounced around, and some tips from someone who has dealt with a billion-row, sharded system (not YouTube, but something similar).
Normalize vs denormalize: My motto: "Normalize, but don't overnormalize." You'll see what I mean after you have done some of it.
One table vs many: Two tables with the essentially identical CREATE TABLEs should usually be a single table. (Sharding, of course, violates that.) OTOH, if you need thousands of UPDATE...view_count = view_count + 1 per second it won't survive to a billion. However, it might survive to a million; then plan for change.
Minimize the size of datatypes -- Using a MEDIUMINT instead of an INT for one column saves a gigabyte.
Do not paginate using OFFSET and LIMIT. (I have a blog on a workaround.)
Batch INSERTs where possible.
Use InnoDB, you don't want to wait hours for a REPAIR to finish on a MyISAM table.
The simple task of getting a unique ID for the 'next' item can be a huge problem in a sharded system. Wait until you are closer to needing sharding before redesigning that part. Do not use UUIDs for a billion-row table; they will perform poorly. So, don't even think about UUIDs now; you will have throw them away.
Long before you hit 1 billion, you will have nightmares about the one machine crashing. Think about replication, HA, etc, early. It is painful to set up such after you have big tables.

MySQL performance; large data table or multiple data tables?

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this

Having a column 'number_of_likes' or have a separate column...?

In my project, I need to *calculate 'number_of_likes' for a particular comment*.
Currently I have following structure of my comment_tbl table:
id user_id comment_details
1 10 Test1
2 5 Test2
3 7 Test3
4 8 Test4
5 3 Test5
And I have another table 'comment_likes_tbl' with following structure:
id comment_id user_id
1 1 1
2 2 5
3 2 7
4 1 3
5 3 5
The above one are sample data.
Question :
On my live server there are around 50K records. And I calculate the *number_of_likes to a particular comment by joining the above two tables*.
And I need to know Is it OK?
Or I should have one more field to the comment_tbl table to record the number_of_likes and increment it by 1 each time it is liked along with inserting it into the comment_likes_tbl....?
Doed it help me by anyway...?
Thanks In Advance.....
Yes, You should have one more field number_of_likes in the comment_tbl table. It will reduce the unnecessary joining of tables.
This way you don't need join until you need to get who liked the comment.
A good example you can see here is the database design of StackOverflow itself. See the Users Table they have a field Reputation with the Users table itself. Instead of Joining and calculating User's reputation every time they use this one.
You can take a few different approaches to something like this
As you're doing at the moment, run a JOIN query to return the collated results of comments and how many "likes" each has
As time goes on, you may find this is a drain on performance. Instead you could simply have a counter that increments attached to each comment field. But you may find it useful to also keep your *comment_likes_tbl* table, as this will be a permanent record of who liked what, and when (otherwise, you would just have a single figure with no additional metadata attached)
You could potentially also have a solution where you simply store your user's likes in the comment_likes_tbl, and then a cron task will run, on a pre-determined schedule, to automatically update all "like" counts across the board. Further down the line, with a busier site, this could potentially help even out performance, even if it does mean that "like" counts lag behind the real count slightly.
(on top of these, you can also implement caching solutions etc. to store temporary records of like values attached to comments, also MySQL has useful caching technology you can make use of)
But what you're doing just now is absolutely fine, although you should still make sure you've set up your indexes correctly, otherwise you will notice performance degradation more quickly. (a non-unique index on comment_id should suffice)
Use the query - as they are foreign keys the columns will be indexed and the query will be quick.
Yes, your architecture is good as it is and I would stick to it, for the moment.
Running too many joins can be a problem regarding performance, but as long as you don't have to face such problems, you shouldn't take care about it.
Even if you will ran into performance problems you should first,...
check if you use (foreign) keys, so that MySQL could lookup the data very fast
take advantage of MySQL Query cache
use some sort of 2nd caching layer, like memcached to store the value of likes (as this is only an incremental value).
The usage of memcache would solve your problem running too many joins and avoid to create a not really necessary column.