Inserting redundant information into the database to prevent table joins? [closed] - mysql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to build a activity stream which has the following structure :
------------------------------------------------------------------------------------
id | activity_by_user_id | activity_by_username | ... other activity related columns
------------------------------------------------------------------------------------
Is this a good approach to store the activity_by_username too in the activity table ? I understand that this will clutter up the table with the same username again and again. But If not, I will have to do a join with the users table to fetch the username.
The username in my web application never changes.
With this, I will no longer have to join this table with the users table. Is this an optimum way of achieving what I need ?

What you are proposing is to denormalize the data structure. There are advantages and disadvantages to this approach.
Clearly, you think that performance will be an advantage, because you will not need to look up the username on each row. This may not be true. The lookup should be on the primary key of the table and should be quite fast. There are even situations where storing the redundant information could slow down the query. This occurs when the field size is large and there are many apps with the same user. Then you are wasting lots of storage on redundant data, increasing the size of the table. Normally, though, you would expect to see a modest -- very modest -- improvement in performance.
Balanced against that is the fact that you are storing redundant data. So, if the user name were updated, then you would have to change lots of rows with the new information.
On balance, I would only advise you to go with such an approach if you tested it on real data in your environment and the performance improvement is worth it. I am skeptical that you would see much improvement, but the proof is in the pudding.
By the way, there are cases where denormalized data structures are needed to support applications. I don't think that looking up a field using a primary key is likely to be one of them.

There isn't a single answer to your question*
In general, relational database design seeks to avoid redundancy to limit the opportunities for data anomalies. For example, you now have the chance that two given rows might contain the same user id but different user names. Which one is correct? How do you prevent such discrepancies?
On the other hand, denormalization by storing certain columns redundantly is sometimes justified. You're right that you avoid doing a join because of that. But now it's your responsibility to make sure data anomalies don't creep in.
And was it really worth it? In MySQL, doing a join to look up a related row by its primary key is pretty efficient (you see this as a join type "eq_ref" in EXPLAIN). I wouldn't try to solve that problem until you can prove it's a bottleneck.
Basically, denormalization optimizes one type of query, at the expense of other types of queries. The extra work you do to prevent, detect, and correct data anomalies may be greater than any efficiency you gain by avoiding the join in this case. Or if usernames were to change sometimes, you'd have to change them in two places now (I know you said usernames don't change in your app).
The point is it depends entirely on your how frequently different queries are run by your application, so it's not something anyone can answer for you.
* That might explain why some people are downvoting your question -- some people in StackOverflow seem to have a rather strict idea about what is a "valid" question. I have seen questions closed or even deleted because they are too subjective and opinion-based. But I have also seen questions deleted because the answer is too "obvious". One of my answers with 100 upvotes was lost because a moderator thought the question of "Do I really need version control if I work solo?" was invalid. Go figure. I copied that one to my blog here.

I think it is bad idea. Databases are optimized for joins (assuming you did your job and indexed correctly) and denormalized data is notoriously hard to maintain. There may be no username changes now but can you guarantee that for the future, no. Risking your data integrity on such a thing is short-sighted at best.
Only denormalize in rare cases where there is an existing performance problem and other optimitization techniques have failed to improve the situation. Denormalizing isn't even always going to get you any performance improvement. As the tables get wider, it may even slow down performance. So don't do it unless you havea measuable performance problem and you measure and ensure the denormlaization actually helps. It is the last optimation technique to try out of all of them, so if you haven't gone through all the optimation techniques in the very large list of possibilities, first, then denormalization should not be an option.

No. This goes against all principles of data normalization.
And it won't even be that difficult (if I'm interpreting what you mean by id, user_id, and user_name); id will be the primary key tying everything together - and the linchpin of your JOINs. So you'll have your main table with id, other activity col, next activity col, etc. (not sure what you mean by activity). Then a 2nd table with just id and user_id and a third with id and username). And when you want to output whatever you're going to output, and do it by user_id or username, you'll just JOIN (implied join syntax - WHERE table1.id = table2.id).

Related

how to improve speed in database? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am starting to create my first web application in my career using mysql.
I am going to make table which contain users information (like id, firstname, lastname, email, password, phone number).
Which of the following is better?
Put all data into one single table (userinfo).
Divide all data by alphabet character and put data into many tables. for example, if user's email id is Joe#gmail.com that put into table (userinfo_j) and if user's email id is kevin#gmail.com that put into table (userinfo_k).
I don't want to sound condescending, but I think you should spend some time reading up on database design before tackling this project, especially the concept of normalization, which provides consistent and proven rules for how to store information in a relational database.
In general, my recommendation is to build your database to be easy to maintain and understand first and foremost. On modern hardware, a reasonably well-designed database with indexes running relational queries can support millions of records, often tens or hundreds of millions of records without performance problems.
If your database has a performance problem, tune the query first; add indexes second, buy better hardware third, and if that doesn't work, you may consider a design that makes the application harder to maintain (often called denormalization).
Your second solution will almost certainly be slower for most cases.
Relational databases are really, really fast when searching by indexed fields; searching for "email like 'Joe#gmail.com'" on a reasonable database will be too fast to measure on a database with tens of millions of records.
However, including the logic to find the right table in which to search will almost certainly be slower than searching in all the tables.
Especially if you want to search by things other than email address - imagine finding all the users who signed up in the last week. Or who have permission to do a certain thing in your application. Or who have a #gmail.com account.
So, the second solution is bad from a design/maintenance point of view, and will almost certainly be slower.
First one is better. In second you will have to write extra logic to find out which table you will start looking into. And for speeding up the search you can implement indexers. Here I suppose you will do equal operations more often rather than less than or more than operations so you can try implementing indexer with Hash. For comparison operations B-Tree are better.
Like others said, the first one is better. Specially if you need to add other tables in your database and link them to userĀ“s table, as the second one will soon get impossible to work and create relationships when your number of tables increase.

"Merging" Multiple Database Tables [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've read through multiple questions here on SO regarding merging multiple databases into one, however they primarily all deal with uniform schema/tables. My apologies if I'm repeating a question.
I have an assortment of database tables that are all similar, but not identical. For example, imagine ten databases with ten "User" tables. All contain a userid (we'll use this for reference). Most contain username and an email columns. Some will contain other columns, such as skype, msn, phone, etc. that exist in only a few of the other tables, or no other tables.
I want to merge this content into one database, with the prerequisite that, moving forward, the possibility of additional databases also containing unique columns will also need to be merged into the new database.
I've been looking at EAV Tables, and was considering something along the lines of (continuing with the example above) a master user table that had a newly-assigned user id (id), originating database reference of some type (database_id), and the originating user-id (native_user_id). I'd then have a separate properties table with a primary key (id), a entity key (user_id), an attribute (attribute) column, and the value (value) column.
The issue at hand is that almost everything I've read recommends against EAV tables while implying there are better ways to go about this. However, I've not actually found any material that covers what this method would be.
So, my questions:
Are EAV Tables really that bad?
What practical major downfalls that I should plan ahead for should I go the EAV table route (any examples of personal experience would be swell)?
What alternatives exist for handling this type of scenario besides EAV tables (while accommodating future attributes without tedious ALTER TABLE commands)?
I used EAV in a project to address requirements similar to yours: lack of a universal data model in the messy real world.
In my case, EAV allowed incremental change as the company grew by acquisition, which in turn caused continual expansion, refinement, or generalization of the data model. The project ultimately failed because management withdrew support for it.
I learned that EAV presents itself to management and users as needlessly complex unless you do the work to create concise views to hide the complexity while preserving the completeness of the data. I also learned that EAV imposes a demand to fill in the "missing answers" in a meaningful way. It isn't enough to say that every answer to a question that wasn't asked in database X is "NULL". Sometimes that is not the right answer. "NULL" becomes a synonym for "I don't know; the attribute didn't exist in this database so no-one ever decided what the value should be".
This is a fairly broad question, eh?
If you have your tables already in SQL I suggest you try experimenting with this sort of UNION ALL query.
SELECT 'one' AS dbid,
id AS id,
first AS first_name,
last AS last_name
FROM first_table
UNION ALL
SELECT 'two' AS dbid,
member_id AS id,
fname AS first_name,
lname AS last_name
FROM members
Etcetera. The idea is to use a UNION ALL query to try to coerce your various sources of information into a single result set, and figure out which of your values from those various sources are somehow conformable. If the lion's share of your data is conformable -- that is, you can simply move it over into appropriate columns in your new tables, you'll avoid the worst pitfalls of EAV storage.
Once you have done that, you can use EAV style storage for your remaining information.
I hope this helps you plan this migration a bit.

mysql table with 40+ columns

I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.

MySQL: multiple tables or one table with many columns? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 11 months ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
So this is more of a design question.
I have one primary key (say the user's ID), and I have tons of information associated with that user.
Should I have multiple tables broken down into categories according to the information, or should I have just one table with many columns?
The way I used to do it was to have multiple tables, so say, one table for application usage data, one table for profile info, one table for back end tokens etc. to keep things looking organized.
Recently some one told me that it's better not to do it that way and having a table with lots of columns is fine. The thing is, all those columns have the same primary key.
I'm pretty new to database design so which approach is better and what are the pros and cons?
What's the conventional way of doing it?
Any time information is one-to-one (each user has one name and password), then it's probably better to have it one table, since it reduces the number of joins the database will need to do to retrieve results. I think some databases have a limit on the number of columns per table, but I wouldn't worry about it in normal cases, and you can always split it later if you need to.
If the data is one-to-many (each user has thousands of rows of usage info), then it should be split into separate tables to reduce duplicate data (duplicate data wastes storage space, cache space, and makes the database harder to maintain).
You might find the Wikipedia article on database normalization interesting, since it discusses the reasons for this in depth:
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Denormalization is also something to be aware of, because there are cases where repeating data is better (since it reduces the amount of work the database needs to do when reading data). I'd highly recommend making your data as normalized as possible to start out, and only denormalize if you're aware of performance problems in specific queries.
One big table is often a poor choice. Related tables are what relational database were designed to work with. If you index properly and know how to write performant queries, they are going to perform fine.
When tables get too many columns, then you can run into issues with the actual size of the page that the database is storing the information on. Either the record can end up being too large for the page, in which can you may end up not being able to create or update a specific record which makes users unhappy or you may (in SQL Server at least) be allowed some overflow for particular datatypes (with a set of rules you need to look up if you are doing this) but if many records will overflow the page size you can create tremedous performance problems. Now how MYSQL handles the pages and whether you have a problem when the potential page size gets too large is something you would have to look up in the documentation for that database.
Came across this, and as someone who used to use MySQL a lot, and then switched over to Postgres recently, one of the big advantages is that you can add JSON objects to a field in Postgres.
So if you are in this situation, you don't have to necessarily decide between one large table with many columns and splitting it up, but you can merge columns into JSON objects to reduce it e.g. instead of address being 5 columns, it can just be one. You can also query on that object too.
I have a good example. Overly Normalized database with the following set of relationships:
people -> rel_p2staff -> staff
and
people -> rel_p2prosp -> prospects
Where people has names and persons details, staff has just the staff record details, prospects has just prospects details, and the rel tables are relationship tables with foreign keys from people linking to staff and prospects.
This sort of design carries on for entire database.
Now to query this set of relations it's a multi-table join every time, sometimes 8 and more table join. It has been working fine up to mid this year, when it started getting very slow now that we past 40000 records of people.
Indexing and all low hanging fruits had been used up last year, all queries are optimized to perfection. This is the end of the road for the particular normalized design and management now approved a rebuilt of entire application that depends on it as well as restructure of the database, over a term of 6 months. $$$$ Ouch.
The solution will be to have a direct relation for people -> staff and people -> prospect
ask yourself these questions if you put everything in one table, will you have multiple rows for that user? If you have to update a user do you want to keep an audit trail? Can the user have more than one instance of a data element? (like phone number for instance) will you have a case where you might want to add an element or set of elements later?
if you answer yes then most likely you want to have child tables with foreign key relationships.
Pros of parent/child tables is data integrity, performance via indexes (yes you can do it on a flat table also) and IMO easier to maintain if you need to add a field later, especially if it will be a required field.
Cons design is harder, queries become slightly more complex
But, there are many cases where one big flat table will be appropriate so you have to look at your situation to decide.
I'm already done doing some sort of database design. for me, it depends on the difficulty of the system with database management; yeah it is true to have unique data in one place only but it is really hard to make queries with overly normalized database with lots of record. Just combine the two schema; use one huge table if you feel that you'll be having a massive records that are hard to maintain just like facebook,gmail,etc. and use different table for one set of record for simple system... well this is just my opinion .. i hope it could help.. just do it..you can do it... :)
The conventional way of doing this would be to use different tables as in a star schema or snowflake schema. Howeevr, I would base this strategy to be two fold. I believe in the theory that data should only exist in one place, there for the schema I mentioned would work well. However, I also believe that for reporting engines and BI suites, a columnar approach would be hugely beneficial becuase it is more supportive of the the reporting needs. Columnar approaches like those with infobright.org have huge performance gains and compression that makes using both approaches incredibly useful. Alot of companies are starting to realize that have just one database architecture in the organization is not supportive of the full range of their needs. Alot of companies are implementing both the concept of having more than one database achitecture.
i think having a single table is more effective but you should make sure that the table is organised in a manner that it shows the relationship,trend as well as the difference in variables of the same row.
for example if the table shows age and grades of the students you should arange the table in a manner that thank highest scorer is well differentiated with the lowest scorer and the difference in the age of students is even.

Storing duplicate data in MySQL tables

I have a table with all registered members, with columns like uid, username, last_action_time.
I also have a table that keeps track of who has been online in the past 5 minutes. It is populated by a cronjob by pulling data from members with last_action_time being less than 5 minutes ago.
Question: Should my online table include username or no? I'm asking this because I could JOIN both tables to obtain this data, but I could store the username in the online table and not have to join. My concern is that I will have duplicate data stored in two tables, and that seems wrong.
If you haven't run into performance issues, DO NOT denormalize. There is a good saying "normalize until it hurts, denormalize until it works". In your case, it works with normalized schema (users table joined). And data bases are designed to handle huge amounts of data.
This approach is called denormalization. I mean that sometimes for quick select query we have to duplicate some data across tables. In this case I believe that this one is good choice if you have a lot of data in both tables.
You just hit a very valid question: when does it make sense to duplicate data ?
I could rewrite your question as: when does it make sense to use a cache. Caches need maintenance, you need to keep them up to date yourself and they use up some extra space (although negligible in this case). But they have a pro: performance increase.
In the example you mentioned, you need to see if that performance increase is actually worth it and if it outweighs the additional work of having and maintaining a cache.
My gut feeling is that your database isn't gigantic, so joining every time should take a minimal amount of effort from the server, so I'd go with that.
Hope it helps
You shouldn't store the username in the online table. There shouldn't be any performance issue . Just use a join every time to get the username.
Plus, you don't need the online table at all, why don't you query only the users with an last_action_time < 5 min from the members table?
A user ID would be an integer (AKA 4 bytes). A username (i would imagine is up to 16 bytes). How many users? How ofter a username changes? These are the questions to consider.
I wold just store the username. I wou;ld have though once the username is registered it is fixed for the duration.
If is difficult to answer these questions without a little background - performance issues are difficult to think about when the depth and breath, usabge etc. is not known.