Adding extra fields to prevent needing joins - mysql

In consideration of schema design, is it appropriate to add extra table fields I wouldn't otherwise need in order to prevent having to do a join? Example:
products_table
| id | name | seller_id
users_table
| id | username |
reviews_table
| id | product_id | seller_id |
For the reviews table, I could use a join on the products table to get the user id of the seller. If I leave it out of the reviews table, I have to use a join to get it. There are often tables where several joins are needed to get at some information where I could just have my app add redundant data to the table instead. Which is more correct in terms of schema design?

You seem overly concerned about the performance of JOINs. With proper indexing, performance is not usually an issue. In fact, there are situations where JOINs are faster -- because the data is more compact in two tables than storing the fields over and over and over again (this applies more to strings than to integers, though).
If you are going to have multiple tables, then use JOINs to access the "lookup" information. There may be some situations where you want to denormalize the information. But in general, you don't. And premature optimization is the root of a lot of bad design.

Suppose you add a column reviews.seller_id and you populate it with values, and then some weeks later you find that the values aren't always the same as the seller in the products_table.
In other words, the following query should always return a count of 0, but what if one day it returns a count of 6?
SELECT COUNT(*)
FROM products_table AS p
JOIN reviews_table AS r USING (product_id)
WHERE p.seller_id <> r.seller_id
Meaning there was some update of one table, but not the other. They weren't both updated to keep the seller_id in sync.
How did this happen? Which table was updated, and which one still has the original seller_id? Which one is correct? Was the update intentional?
You start researching each of the 6 cases, verify who is the correct seller, and update the data to make them match.
Then the next week, the count of mismatched sellers is 1477. You must have a bug in your code somewhere that allows an update to one table without updating the other to match. Now you have a much larger data cleanup project, and a bug-hunt to go find out how this could happen.
And how many other times have you done the same thing for other columns -- copied them into a related table to avoid a join? Are those creating mismatched data too? How would you check them all? Do you need to check them every night? Can they be corrected?
This is the kind of trouble you get into when you use denormalization, in other words storing columns redundantly to avoid joins, avoid aggregations, or avoid expensive calculations, to speed up certain queries.
In fact, you don't avoid those operations, you just move the work of those operations to an earlier time.
It's possible to make it all work seamlessly, but it's a lot more work for the coder to develop and test the perfect code, and fix the subsequent code bugs and inevitable data cleanup chores.

This depends on each specific case. Purely in terms of schema design, you should not have any redundant columns (see database normalization). However, in a real case scenario, sometimes it makes sense to have redundant data; for example, when having performance issues, you can sacrifice some memory in order to make SELECT queries faster.

Adding redundant column today will make you curse tomorrow.If you Handle keys in database properly, performance will not penalize you.

Related

Divide SQL data in different tables

I want to split users' data in different tables so that there isn't an huge one containing all data...
The problem is that in tables different from the main one I can't recognize who each data belongs to.
Should I store the same user id in every table during the signup? Doesn't it create unnecessary duplicates?
EDIT:
example
table:
| id | user | email | phone number| password | followers | following | likes | posts |
becomes
table 1:
| id | user | email | phone number| password |
table 2:
| id | followers num | following num | likes num | posts num |
This looks like a "XY problem".
You want to "not have a huge table". But why is it that you have this requirement?
Probably it's because some responses in some scenarios are slower than you expect.
Rather than split tables every which way, which as Gordon Linoff mentioned is a SQL antipattern and liable to leave you more in the lurch than before, you should monitor your system and measure the performances of the various queries you use, weighing them by frequency. That is, if query #1 is done one hundred thousand times per period and takes 0.2 seconds, that's 20,000 seconds you should chalk up to query #1. Query #2 which takes fifty times longer - ten full seconds - but is only run one hundred times will only accrue one twentieth of the total time of the first.
(Since long delays are noticeable by the end users, some use a variation of this formula in which you multiply the instances of one query for the square - or higher powers - of its duration in milliseconds. This way, slower queries are brought more attention to).
Be it what may, once you know which queries you should optimize first, then you can start optimizing your schema.
The first thing to check are indexes. And maybe normalization. Those cover a good two thirds of the "low performing" cases I have met so far.
Then there's segmentation. Not in your case maybe, but you might have a table of transactions or such where you're usually only interested in the current solar or fiscal year. Adding a column with that information will make the table larger, but selecting only those records that at minimum match a condition on the year will make most queries run much faster. This is supported at a lower level also (see "Sharding").
Then there are careless JOINs and sub-SELECTs. Usually they start small and fast, so no one bothers to check indexes, normalization or conditions on those. After a couple of years, the inner SELECT is gathering in one million records, and the outer JOIN discards nine hundred and ninety-nine thousand of them. Translate the discarding condition inside the subselect and see the query take off.
Then you can check whether some information is really rarely accessed (for example, I have one DB where each user has a bunch of financial information, but this is only needed in maybe 0.1% of requests. So in that case yes, I have split that information in a secondary table, also gaining the possibility of supporting users with multiple bank accounts enrolled in the system. That was not why I did it, mind you).
In all this, also take into account time and money. Doing the analysis, running the modifications and checking them out, plus any downtime, is going to cost something and possibly even increase maintenance costs. Maybe - just maybe - throwing less money than that into a faster disk or more RAM or more or faster CPUs might achieve the same improvements without any need to alter either the schema or your code base.
I think you want to use a LEFT JOIN
SELECT t1.[user], t2.[posts]
FROM Table1 AS t1
LEFT JOIN Table2 AS t2 ON t1.id= t2.id
EDIT: Here is a link to documentation that explains different types of JOINS
I believe I understand your question and if you are wondering, you can use a foreign key. When you have a list of users, make sure that each user has a specific id.
Later, when you insert data about a user you can insert the users id via a session variable or a get request. (insert into different table)
Then, when you need to pull data for that specific user from that different table/s, you can just select from table where id = session[id] or get[id]
does that help?
answer: use foreign key to identify users data using gets and sessions
don't worry about duplicates if you are removing those values form the main table.
One table would probably have an AUTO_INCREMENT for the PRIMARY KEY; the other table would have the identical PK, but it would not be AUTO_INCREMENT. JOINing the tables will put the tables "back together" for querying.
There is rarely a good reason to "vertically partition" a table. One rare case is to split out the "like_count" or "view_count". This way the main table would not be bothered by the incessant UPDATEing of the counters. In some extreme cases, this may help performance.

Join 10 tables on a single join id called session_id that's stored in session table. Is this good/bad practice?

There's 10 tables all with a session_id column and a single session table. The goal is to join them all on the session table. I get the feeling that this is a major code smell. Is this good/bad practice ?
What problems could occur?
Whether this is a good design or not depends deeply on what you are trying to represent with it. So, it might be OK or it might not be... there's no way to tell just from your question in its current form.
That being said, there are couple ways to speed up a join:
Use indexes.
Use covering indexes.
Under the right DBMS, you could use a materialized view to store pre-joined rows. You should be able to simulate that under MySQL by maintaining a special table via triggers (or even manually).
Don't join a table unless you actually need its fields. List only the fields you need in the SELECT list (instead of blindly using *). The fastest operation is the one you don't have to do!
And above all, measure on representative amounts of data! Possible results:
It's lightning fast. Yay!
It's slow, but it doesn't matter that it's slow (i.e. rarely used / not important).
It's slow and it matters that it's slow. Strap-in, you have work to do!
We need Query with 11 joins and the EXPLAIN posted in the original question when it is available, please. And be kind to your community, for every table involved post as well SHOW CREATE TABLE tblname SHOW INDEX FROM tblname to avoid additional requests for these 11 tables. And we will know scope of data and cardinality involved for each indexed column.
of Course more join kills performance.
but it depends !! if your data model is like that then you can't help yourself here unless complete new data model re-design happen !!
1) is it a online(real time transaction ) DB or offline DB (data warehouse)
if online , then better maintain single table. keep data in one table , let column increase in size.!!
if offline , it's better to maintain separate table , because you are not going to required all column always.!!

Does this de-normalization make sense?

I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?
It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.
In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.

MySQL performance; large data table or multiple data tables?

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this