Basic database design and complexity - mysql

I am designing a system which has a database for storing users and information related to the users. More specifically each user in the table has very little information. Something like Name, Password, uid.
Then each user has zero or more containers, and the way I've initially done this is to create a second table in the database which holds containers and have a field referencing the user owning it. So something like containerName, content, owner.
So a query on data from a container would look something like:
SELECT content
FROM containers
WHERE (containerName='someContainer' AND owner='someOwner');
My question is if this is a good way, I am thinking scalability say that we have thousands of users with say... 5 containers each (however each user could have a different number of containers, but 5 would probably be a typical case). My concern is that searching through the database will become slow when there is 5 entries out of 5*1000 entries I could ever want in one query. (We may typically only want a specific container's content from our query and we are looking into the database with basically a overhead of 4995 entries, am I right? And what happen if I subscribed a million users, it would become a huge table which just intuitively feel like a bad idea.
A second take on it which I had would be to have tables per user, however that doesn't feel like a very good solution either since that would give me 1000 tables in the database which (also by intuition) seem like a bad way to do it.
Any help in understanding how to design this would be greatly appreciated, I hope it's all clear and easy to follow.

The accepted way of handling this is by creating an INDEX on the owner field. That way, MySQL optimized queries for owner = 'some value' conditions.
See also: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
You're right in saying that a 1000 tables is not scalable. Once you start reaching a few million records you might want to consider doing sharding (split up records into several locations based on user attributes) ... but by that time you'd already be quite successful I think ;-)

If it is an RBMS(like Oracle / MySQL) datbase, you can create indexes on columns that are frequently queried to optimize the table traversal and query. Indexes are automatically created for PRIMARY and (optionally for) FOREIGN keys.

Related

database, a table for each users or a big table?

I just start to learn database, in designing a database, I notice that a lot of recommendations, such as in this thread, suggests NOT to use one table per user, but keep all data in a big table and do a query when needed. But I still do NOT understand, because it seems that under a lot of situations, one table per user seems much efficient.
Suppose I have a database for 10,000 customers for them to track their orders. Each of customers will have very few orders, like around 10. In this way, every customer logs in, you will have to go through a big table to fetch data for this customer, however, if you keep each table per user, you can directly get what the customer need.
Another example, a restaurant information system tracks all restaurants' menu (say, in [foodname, price] pair), since each restaurant has different number of dishes, you can't really put each menu in one row, you can only make a huge table with [foodname,price,restaurant] rows. But there are a lot of restaurants, so when a user needs the menu of a certain restaurant, you'll need to go through the data of all restaurants, obviously inefficient.
For both these two examples, I can't think of a good way to design a database if I don't want to create each table per user. So my question is this:
If we want to avoid each table per user design, how should we design a database for these kinds of situations?
Sql databases are designed exactly for the types of scenarios you are suggesting. They can handle millions or billions of rows extremely efficiently. The complications of trying to partition every customer into a separate table are vast.
The only thing you need to worry about is that you have indexes on your table so that you do not have to scan through that billion records to find the ones applicable to your customer.
Once the indexes are in place then all of your example scenarios become simple and efficient queries.
Databases are designed to do exactly the kinds of lookups you're describing efficiently, even if all users are in a single table. As long as you create an index by user ID (or have the user ID as part of the primary key), then the database will keep the table sorted by user ID, so it can find any particular user efficiently using binary search.
"Tables" don't mean exactly what you think they mean either. Tables are meant to be used to logically group data in ways that are useful for the programmer. In theory, any database you use could just consist of one big table, but it's generally easier to reason about a database if you know that rows of the User table look like this, while rows of the Message table (or whatever) look like that. In fact, many databases only actually have one big underlying "table" in which all the data lives. So, whether two users are in the "same table" or "different tables" often doesn't matter at all from an efficiency standpoint.
Database management software is written based on the assumption that you'll have a relatively small number of tables (dozens, maybe hundreds in extreme cases). So go with whatever your database's documentation recommends.

What Are Good Solutions for a Database Table that Gets to Long?

I will describe a problem using a specific scenario:
Imagine that you create a website towhich users can register,
and after they register, they can send Private Messages to each other.
This website enables every user to maintain his own Friends list,
and also maintain a Blocked Users list, from which he prefers not to get messages.
Now the problem:
Imagine this website getting to several millions of users,
and let's also assume that every user has about 10 Friends in the Friends table, and 10 Blocked Users in the Blocked Users table.
The Friends list Table, and the Blocked Users table, will become very long,
but worse than that, every time when someone wants to send a message to another person "X",
we need to go over the whole Blocked Users table, and look for records that the user "X" defined - people he blocked.
This "scanning" of a long database table, each time a message is sent from one user to another, seems quite inefficient to me.
So I have 2 questions about it:
What are possible solutions for this problem?
I am not afraid of long database tables,
but I am afraid of database tables that contain data for so many users,
which means that the whole table needs to be scanned every time, just to pull out a few records from it for that specific user.
A specific solution that I have in my mind, and that I would like to ask about:
One solution that I have in mind for this problem, is that every user that registers to the website, will have his own "mini-database" dynamically (and programmatically) created for him,
that way the Friends table, an the Blocked Users table, will contain only records for him.
This makes scanning those table very easy, because all the records are for him.
Does this idea exist in Databases like MS-SQL Server, or MySQL? And If yes, is it a good solution for the described problem?
(each user will have his own small database created for him, and of course there is also the main (common) database for all other data that is not user specific)
Thank you all
I would wait on the partitioning and on creating mini-database idea. Is your database installed with the data, log and temp files on different RAID drives? Do you have clustered indexes on the tables and indexes on the search and join columns?
Have you tried any kind of reading Query Plans to see how and where the slowdowns are occurring? Don't just add memory or try advanced features blindly before doing the basics.
Creating separate databases will become a maintenance nightmare and it will be challenging to do the type of queries (for all users....) that you will probably like to do in the future.
Partitioning is a wonderful feature of SQL Server and while in 2014 you can have thousands of partitions you probably (unless you put each partition on a separate drive) won't see the big performance bump you are looking for.
SQL Server has very fast response time for tables (especially for tables with 10s of millions of rows (in your case the user table)). Don't let the main table get too wide and the response time will be extremely fast.
Right off the bat my first thought is this:
https://msdn.microsoft.com/en-us/library/ms188730.aspx
Partitioning can allow you to break it up into more manageable pieces and in a way that can be scalable. There will be some choices you have to make about how you break it up, but I believe this is the right path for you.
In regards to table scanning if you have proper indexing you should be getting seeks in your queries. You will want to look at execution plans to know for sure on this though.
As for having mini-DB for each user that is sort of what you can accomplish with partitioning.
Mini-Database for each user is a definite no-go zone.
Plus on a side note A separate table to hold just Two columns UserID and BlockedUserID both being INT columns and having correct indexes, you cannot go wrong with this approach , if you write your queries sensibly :)
look into table partitioning , also a well normalized database with decent indexes will also help.
Also if you can afford Enterprise Licence table partitioning with the table schema described in last point will make it a very good , query friendly database schema.
I did it once for a social network system. Maybe you can look for your normalization. At the time I got a [Relationship] table and it just got
UserAId Int
UserBId Int
RelationshipFlag Smallint
With 1 million users and each one with 10 "friends" that table got 10 millions rows. Not a problem since we put indexes on the columns and it can retrieve a list of all "related" usersB to a specific userA in no time.
Take a good look on your schema and your indexes, if they are ok you DB ill not got problems handling it.
Edit
I agree with #M.Ali
Mini-Database for each user is a definite no-go zone.
IMHO you are fine if you stick with the basic and implement it the right way

Is it good or bad practice to have multiple foreign keys in a single table, when the other tables can be connected using joins?

Let's say I wanted to make a database that could be used to keep track of bank accounts and transactions for a user. A database that can be used in a Checkbook application.
If i have a user table, with the following properties:
user_id
email
password
And then I create an account table, which can be linked to a certain user:
account_id
account_description
account_balance
user_id
And to go the next step, I create a transaction table:
transaction_id
transaction_description
is_withdrawal
account_id // The account to which this transaction belongs
user_id // The user to which this transaction belongs
Is having the user_id in the transaction table a good option? It would make the query cleaner if I wanted to get all the transactions for each user, such as:
SELECT * FROM transactions
JOIN users ON users.user_id = transactions.user_id
Or, I could just trace back to the users table from the account table
SELECT * FROM transactions
JOIN accounts ON accounts.account_id = transactions.account_id
JOIN users ON users.user_id = accounts.user_id
I know the first query is much cleaner, but is that the best way to go?
My concern is that by having this extra (redundant) column in the transaction table, I'm wasting space, when I can achieve the same result without said column.
Let's look at it from a different angle. From where will the query or series of queries start? If you have customer info, you can get account info and then transaction info or just transactions-per-customer. You need all three tables for meaningful information. If you have account info, you can get transaction info and a pointer to customer. But to get any customer info, you need to go to the customer table so you still need all three tables. If you have transaction info, you could get account info but that is meaningless without customer info or you could get customer info without account info but transactions-per-customer is useless noise without account data.
Either way you slice it, the information you need for any conceivable use is split up between three tables and you will have to access all three to get meaningful information instead of just a data dump.
Having the customer FK in the transaction table may provide you with a way to make a "clean" query, but the result of that query is of doubtful usefulness. So you've really gained nothing. I've worked writing Anti-Money Laundering (AML) scanners for an international credit card company, so I'm not being hypothetical. You're always going to need all three tables anyway.
Btw, the fact that there are FKs in the first place tells me the question concerns an OLTP environment. An OLAP environment (data warehouse) doesn't need FKs or any other data integrity checks as warehouse data is static. The data originates from an OLTP environment where the data integrity checks have already been made. So there you can denormalize to your hearts content. So let's not be giving answers applicable to an OLAP environment to a question concerning an OLTP environment.
You should not use two foreign keys in the same table. This is not a good database design.
A user makes transactions through an account. That is how it is logically done; therefore, this is how the DB should be designed.
Using joins is how this should be done. You should not use the user_id key as it is already in the account table.
The wasted space is unnecessary and is a bad database design.
In my opinion, if you have simple Many-To-Many relation just use two primary keys, and that's all.
Otherwise, if you have Many-To-Many relation with extra columns use one primary key, and two foreign keys. It's easier to manage this table as single Entity, just like Doctrine do it. Generally speaking simple Many-To-Many relations are rare, and they are usefull just for linking two tables.
Denormalizing is usually a bad idea. In the first place it is often not faster from a performance standard. What it does is make the data integrity at risk and it can create massive problems if you end up changing from a 1-1 relationship to a 1-many.
For instance what is to say that each account will have only one user? In your table design that is all you would get which is something I find suspicious right off the bat. Accounts in my system can have thousands of users. SO that is the first place I question your model. Did you actually think interms of whether the realtionships woudl be 1-1 or 1-many? Or did you just make an asssumpltion? Datamodels are NOT easy to adjust after you have millions of records, you need to do far more planning for the future in database design and far more thinking about the data needs over time than you do in application design.
But suppose you have one-one relationship now. And three months after you go live you get a new account where they need to have 3 users. Now you have to rememeber all the places you denornmalized in order to properly fix the data. This can create much confusion as inevitably you will forget some of them.
Further even if you never will need to move to a more robust model, how are you going to maintain this if the user_id changes as they are going to do often. Now in order to keep the data integrity, you need to have a trigger to maintain the data as it changes. Worse, if the data can be changed from either table you could get conflicting changes. How do you handle those?
So you have created a maintenance mess and possibly risked your data intergrity all to write "cleaner" code and save yourself all of ten seconds writing a join? You gain nothing in terms of things that are important in database development such as performance or security or data integrity and you risk alot. How short-sighted is that?
You need to stop thinking in terms of "Cleaner code" when developiong for databases. Often the best code for a query is the most complex appearing as it is the most performant and that is critical for databases. Don't project object-oriented coding techniques into database developement, they are two very differnt things with very differnt needs. You need to start thinking in terms of how this will play out as the data changes which you clearly are not doing or you would not even consider doing such a thing. You need to think more of thr data meaning and less of the "Principles of software development" which are taught as if they apply to everything but in reality do not apply well to databases.
It depends. If you can get the data fast enough, used the normalized version (where user_id is NOT in the transaction table). If you are worried about performance, go ahead and include user_ID. It will use up more space in the database by storing redundant information, but you will be able to return the data faster.
EDIT
There are several factors to consider when deciding whether or not to denormalize a data structure. Each situation needs to be considered uniquely; no answer is sufficient without looking at the specific situation (hence the "It depends" that begins this answer). For the simple case above, denormalization would probably not be an optimal solution.

Storing duplicate data in MySQL tables

I have a table with all registered members, with columns like uid, username, last_action_time.
I also have a table that keeps track of who has been online in the past 5 minutes. It is populated by a cronjob by pulling data from members with last_action_time being less than 5 minutes ago.
Question: Should my online table include username or no? I'm asking this because I could JOIN both tables to obtain this data, but I could store the username in the online table and not have to join. My concern is that I will have duplicate data stored in two tables, and that seems wrong.
If you haven't run into performance issues, DO NOT denormalize. There is a good saying "normalize until it hurts, denormalize until it works". In your case, it works with normalized schema (users table joined). And data bases are designed to handle huge amounts of data.
This approach is called denormalization. I mean that sometimes for quick select query we have to duplicate some data across tables. In this case I believe that this one is good choice if you have a lot of data in both tables.
You just hit a very valid question: when does it make sense to duplicate data ?
I could rewrite your question as: when does it make sense to use a cache. Caches need maintenance, you need to keep them up to date yourself and they use up some extra space (although negligible in this case). But they have a pro: performance increase.
In the example you mentioned, you need to see if that performance increase is actually worth it and if it outweighs the additional work of having and maintaining a cache.
My gut feeling is that your database isn't gigantic, so joining every time should take a minimal amount of effort from the server, so I'd go with that.
Hope it helps
You shouldn't store the username in the online table. There shouldn't be any performance issue . Just use a join every time to get the username.
Plus, you don't need the online table at all, why don't you query only the users with an last_action_time < 5 min from the members table?
A user ID would be an integer (AKA 4 bytes). A username (i would imagine is up to 16 bytes). How many users? How ofter a username changes? These are the questions to consider.
I wold just store the username. I wou;ld have though once the username is registered it is fixed for the duration.
If is difficult to answer these questions without a little background - performance issues are difficult to think about when the depth and breath, usabge etc. is not known.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.