MySql multiple tables vs single table: performance - mysql

Imagine the following categories: bars, places to eat, shops, etc...
Each of these categories will have some common fields, such as
id, name, address and geolocation (Lat and Lng position).
I am strongly in doubt whether I should create one table that combines these different categories, or whether I should split it up into separate tables (1 table per category).
My thought is that retrieving places based on category and geolocation for each category table separately would be faster in both retrieval and update, certainly when the number of places per category increases.
From this approach, I would go for 1 table per category.
BUT there is a supplementary requirement. Every place will have an owner (user), and one user could own multiple places. So this means that I would either:
Need a many-to-many table that connects the user table with the central giant table;
Need one many-to-many table for each of the categories.
Second BUT the situation is, that when the user logs in, all places that this user owns will be returned from a single query, ie. the id + name, assuming that a user could be the owner of multiple places.
From this point of view, the second option seems a very bad idea, because it appears that I would need to create queries to scan through each of the tables.
I realize that I can use indexes to speed up long table scans, yet I am still uncertain about the performance when the number of places would increase dramatically, and there are currently about 8 different categories.
What do you consider the most optimal solution, based on the options that I proposed (or do you see a better option I'm missing out on?).
I should point out that the web application will not often mix up categories, although bars will be able to create events as well.
The answer to this question is invaluable to me, because it will define the fundaments for the further development of my application.
Let me know if anything in my question is not clear to you.
Thank you.

When the data is that similar, a single table is usually best. Consider also that in the future you may add a new category - it would be much easier to just update a category field to include the new category than build a new table, new queries, and modify all the existing code.
As far as speed goes, indexes will make it irrelevant. Add a category field, index it, and speed won't be an issue.

Related

database, a table for each users or a big table?

I just start to learn database, in designing a database, I notice that a lot of recommendations, such as in this thread, suggests NOT to use one table per user, but keep all data in a big table and do a query when needed. But I still do NOT understand, because it seems that under a lot of situations, one table per user seems much efficient.
Suppose I have a database for 10,000 customers for them to track their orders. Each of customers will have very few orders, like around 10. In this way, every customer logs in, you will have to go through a big table to fetch data for this customer, however, if you keep each table per user, you can directly get what the customer need.
Another example, a restaurant information system tracks all restaurants' menu (say, in [foodname, price] pair), since each restaurant has different number of dishes, you can't really put each menu in one row, you can only make a huge table with [foodname,price,restaurant] rows. But there are a lot of restaurants, so when a user needs the menu of a certain restaurant, you'll need to go through the data of all restaurants, obviously inefficient.
For both these two examples, I can't think of a good way to design a database if I don't want to create each table per user. So my question is this:
If we want to avoid each table per user design, how should we design a database for these kinds of situations?
Sql databases are designed exactly for the types of scenarios you are suggesting. They can handle millions or billions of rows extremely efficiently. The complications of trying to partition every customer into a separate table are vast.
The only thing you need to worry about is that you have indexes on your table so that you do not have to scan through that billion records to find the ones applicable to your customer.
Once the indexes are in place then all of your example scenarios become simple and efficient queries.
Databases are designed to do exactly the kinds of lookups you're describing efficiently, even if all users are in a single table. As long as you create an index by user ID (or have the user ID as part of the primary key), then the database will keep the table sorted by user ID, so it can find any particular user efficiently using binary search.
"Tables" don't mean exactly what you think they mean either. Tables are meant to be used to logically group data in ways that are useful for the programmer. In theory, any database you use could just consist of one big table, but it's generally easier to reason about a database if you know that rows of the User table look like this, while rows of the Message table (or whatever) look like that. In fact, many databases only actually have one big underlying "table" in which all the data lives. So, whether two users are in the "same table" or "different tables" often doesn't matter at all from an efficiency standpoint.
Database management software is written based on the assumption that you'll have a relatively small number of tables (dozens, maybe hundreds in extreme cases). So go with whatever your database's documentation recommends.

Database for 'who viewed this item also viewed..'

I want to create feature 'who viewed this item also viewed' like Amazon or Ebay. I'm deciding between MySql and non-relational database like MongoDB.
Edit: It seems to be straightforward to implement this feature in MySql. My guess is creating 'viewed' table in which userId, itemId, and time of viewing are saved. So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId) Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId
Wouldn't this be too much for 100,000 users who viewed 100 pages this month?
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of itemIds he looked at and each Item holds a list of userIds seen by. And I'm not sure what to do next. Am I on the right path here?
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Initial Response
It seems to be straightforward to implement this feature in MySql by just calling JOIN on Item and User table.
Yes.
But, how fast or slow the database call will be to gather entire viewing history of 100,000 users at once?
How long is a piece of string ?
That depends on the standards and quality of your Relational Database implementation. If you have ID fields on all your files, it won't have Relational integrity, power, or speed, it will have 1970's ISAM Record Filing System speeds.
On a Sybase ASE server, on a small Unix box, a SELECT of similar intent on a table (not a file) with 16 billion rows returns 100 rows in 12 milliseconds.
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
I can't answer re MangoDb.
But for a Relational Database, that is how we implement it.
with one great difference: the two lists are implemented in a single table
each row is a single fact viewed [sorry] from two sides (the fact that an User has viewed an Item, is one and the same fact that an Item has been viewed by an User)
So it appears to be Relational thinking ... implemented Mango-style, which requires 100% data and table duplication. I have no idea whether that is good or bad in MongoDb, in the sense that it could well be what is required for the thing to "perform". Ugly as sin.
And I'm not sure what to do next. Am I on the right path here?
Right for Relational (as long as you use one table for the two "lists"). Ask a more specific question if you do not understand this point.
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Sorry, I can't answer that.
But it would be unlikely that a non-relational DB can store and retrieve info that is classic Relational, faster than a semi-relational Record Filing System such as MySQL. All things being equal, of course. A real SQL platform would be faster still.
Response to Comments
First you had:
So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
That is two lists. That is not good, because the second list is a 100% duplication of the first.
Now you have (edited in the Question, and in the new comments):
I didn't fully understand what you meant by 'use one table for the two list'. My interpretation is create 'viewed' table in which userId, itemId, and time of viewing are saved.
That is good, you now have one list.
Just to be clear about the database we are discussing, let me erect a model, and have you confirm it.
User Item Data Model
If you are not used to the standard Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId). Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId. Is this what you mean?
I did make a declaration and caution about the table, but I didn't give any directions regarding non-SQL coding, so no.
I would never suggest doing something in two steps, that can be done in a single step. SQL has its problems, but difficulty in obtaining information from a set of Relational tables (ie. a derived relation) using a single SELECT is definitely not one of them.
SUB is not SQL. Although I can guess at what it does, I may well be wrong, therefore I cannot comment on that code.
Against the model I have supplied, on an ISO/IEC/ANSI Standard SQL platform, I would use:
SELECT DISTINCT ItemId -- Items viewed by ...
FROM UserItem
WHERE UserId = (
SELECT UserId -- Users who viewed Item
FROM UserItem
WHERE ItemId = #CurrentItemId
)
You will have to translate that into the non-SQL that your platform requires.
Wouldn't it be too much for 100,000 users who viewed 100 pages this month? Sorry for long question.
I have already answered that question in my initial response. Please read again.
You are trying to solve a performance problem that you do not yet have. That is not possible, given the laws of physics, the dependencies, our inability to reverse the chronology; etc. Therefore I recommend that you cease that activity.
Meanwhile, back at the farm, the cows need to be fed. Design the database first, then code the app, then if, and only if, there are performance problems, you can address them. IT Professionals can make scientific estimates, but I cannot give you a tutorial here in SO.
10,000,000 page views per month. You have not stated the no of Items, so the large figure is scary as hell. if you inform me as to how many Items; Users; Average Items viewed per session; and the duration (eg. month) you wish to cover, I can give you more specific advice.
As I understand it, an User views 1 (one) Item. As a selling-up feature, you want the system to identify the list of Items people "who viewed this item also viewed ...". That would appear to be a small fraction of 10,000,000 views. You do have an index on each table, yes ? So the non-SQL program you are using will not read 10,000,000 views to find that fraction, it will navigate the index, and read only the pages that contain that fraction.
Some of the non-SQLs need a second index to perform what real SQL platforms perform with one index. I have given that second index in the model.
While I appreciate that it was alright that a full definition was not provided for the file you described, up to now, since I am providing a model, I have to provide a complete and correct one, not a partial one.
Since Users view Items more than once, I have given a table that allows that, and tracks the Number of Views, and the Date Last Viewed. It is one row per User::Item, ever. If you would like a table that supports one row per User::Item view, please ask, I will provide.
From where I sit, on the basis of facts established thus far, the 10,000,000 figure is not concern.
This probably depends more on how you implement this feature than on the type of database used.
If you just store a lot of viewing history (like, "user x looked at item y"), you'd have to check out the users who viewed an item, and then all the items those users looked at. That can all be done on a single database table. However may end up with very large result sets.
It may be easier to use a graph structure of "connected" items that is continually updated during runtime and then easily queried.

Best table structure for users with different roles

We are working on a website which will feature about 5 different user roles, each with different properties. In the current version of the database schema we have a single users table which holds all the users, and all of their properties.
The problem is that the properties that we need differ per user role. All users have the same basis properties, like a name, e-mail address and password. But on top of that the properties differ per role. Some have social media links, others have invoice addresses, etc. In total there may be up to 60 columns (properties), of which only a portion are used by each user role.
In total we may have about 250,000 users in the table, of which the biggest portion (about 220,000) will be of a single user role (and use about 20 of the 60 columns). The other 30,000 users are divided over four other rules and use a sub-set of the other 40 columns.
What is the best database structure for this, both from a DB as a development perspective? My idea is to have a base users table, and then extend on that with tables like users_ moderators, but this may lead to a lot of JOIN'ed queries. A way to prevent this is by using VIEWs, but I've read some (out-dated?) articles that VIEWs may hurt performance, like: http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/.
Does the 'perfect' structure even exist? Any suggestion, or isn't this really a problem at all and should we just put all users in a single big tables?
There are two different ways to go about this. One is called "Single Table Inheritance". This is basically the design you ask for comments on. It's pretty fast because there are no joins. However NULLs can affect throughput to a small degree, because fat rows take a little longer to bring into memory than thinner rows.
An alternative design is called "Class Table Inheritance". In this design, there is one table for the super class and one table for each subclass. Non key attributes go into the table where they pertain. Often, a design called "Shared Primary Key" can be used with this design. In shared primary key, the key ids of the subclass tables are copies of the id from the corresponding row in the superclass table.
It's a little work at insert time, but it pays for itself when you go to join data.
You should look up all three of these in SO (they have their own tags) or out on the web. You'll get more details on the design, and an indication of how well each design fits your case.
'Perfect' structure for such cases, in my opinion, is party-role-relationship model. Search for Len Silverston's books about data models. It looks quite complicated at the beginning, but it gives great flexibility...
The biggest question is practicability of adopting perfect solution. Nobody except you can answer that. Refactoring is never an easy and fast task, so say if your project lifetime is 1 year, spending 9 month paying out 'technical debts' sounds more like wasting of time/efforts/etc.
As for performance of joins, having proper indexes usually solves potential issues. If not, you can always implement materialized view ; even though mysql doesn't have such option out of the box, you can design it yourself and refresh it in different ways(for instance, using triggers or launch refresh procedure periodically/on demand).
table user
table roles
table permissions
table userRole
table userPermission
table RolesPermissions
Each role have is permissions in role permissions table
Each user can have a permission whitout the role (extention...)
So in PHP you just have to merge arrays of user permissions in user roles and extended permissions...
And in your "acl" class you check if your user have the permission to view or process a webpage or a system process...
I think you don't need to worry about speed here so much.
Because it will be one time thing only. i.e. on user login store acl in session and get it next time from there.
JOINs are not so bad. If you have your indexes and foreign keys in right places with InnoDB engine it will be really fast.
I would use one table for users and role_id. Second table with roles. Third table for resources, and one to link it all together + enabled flag.

Modelling a "one to one or two" relationship

I have an uncommon database design problem I'm not sure how to handle properly. There is a table called profile storing a website users' public profile information. However, every profile can belong to either a single person or a couple so I need an additional child table called person to store person-specific data. Every profile entity must have at least one but no more than two person child entities.
What is the best (in terms of being "kosher" and/or performance) way to model such relationship? Should I go with regular one-to-many and enforce the number of children programatically or with stored procedures? Or should I just create two foreign key fields in the parent table and allow null for one of them? Maybe there's another way I can't think of?
Edit: Additional info in response to Gordon's questions
A person can be related to only one profile and there can't be a person without a profile. Perhaps the name person is confusing, as it may suggest that a person has profile, while in fact it's the profile that has person information.
In case of couple profiles both persons are equal. Due to the site's specific the limit on 2 will never change, however it should be possible to add or remove a person (to make a single person profile a couple profile and vice-versa) but there can never be less than 1 or more than 2 persons.
The person data would never be fetched without the profile data but the profile data could sometimes be fetched without the person data.
1)
The solution with two fields:
PRO: Allows you to precisely restrict both minimal and maximal number of people per proflie.
CON: Would allow a profile-less person.
CON: Would require 2 indexes (1 on each field) to efficiently get the profile of a given person, taking additional space and potentially slowing down INSERT/UPDATE/DELETE.
2)
But if you are willing to enforce the minimal number at the application level, you might be better off with something like this:
CHECK(PERSON_NO = 1 OR PERSON_NO = 2)
Characteristics:
CON: Allows a person-less profile.
PRO: Restricts maximal number of people per profile, yet easy to change by just modifying the CHECK.
PRO: If you keep the identifying relationship as above, it doesn't require additional indexes and is clustering-friendly (persons of the same profile can be stored physically close together, minimizing I/O during JOIN).
On the other hand, if you have a key PERSON_ID (or similar), then an additional index on {PROFILE_ID, PERSON_NO} would be necessary for the efficient enforcement of key constraint on these fields too.
3)
Theoretically, you could even combine the two approaches and avoid both profile-less persons and person-less profiles:
(PERSON1_ID is not NULL-able, PERSON2_ID is NULL-able)
However, this leads to circular references, requiring deferred constraints to resolve, which are unfortunately not supported by MySQL.
4)
And finally, you could just take a brute-force approach and simply place fields of both persons in the profile table (and make one of these sets not NULL-able and the other NULL-able).
Out of all these possibilities, I'd probably go with 2).
You essentially have two options, as you mention in your question. You can store two fields in the table. Or, you can have a second table that has the mapping information.
Here are some additional question to help you answer the question:
Can a person have their own profile and a profile as part of a couple?
Are both people on a profile "equal" or is one the "master" and the other an "alternate"?
When you fetch profile information, will you always be including information about all people on the profile?
Can you have persons without profiles?
In this case, I just have the sneaky suspicion that the limit on "2" may change in the future. This suggests storing the mapping in a separate table, since increasing "2" by adding a field is a problem in terms of modifying existing code. In other words, creating a separate table, person-profile, that maps persons to profiles. In mysql, you can always gather the person-level information using GROUP_CONCAT().
One case where it is better to put such similar fields in the same table is when one is clearly preferred and the other is the alternate. In that case, you are doing a lot of "coalesce(, )" type of logic.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.