I have been working on my database and the thought occurred to me that maybe it would be better to combine two of my tables to better organise the data and perhaps get performance benefits (or not?).
I have two tables that contain addresses, delivery and the other invoice, their structure is identical. One table contains invoice addresses and the other contains delivery.
What would be the implications of merging these together into one table simply called "addresses", and create a new column called addressTypeId? This new column references a new table that contains address types like delivery, invoice, home etc.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
By the way I am using INNODB.
If you are missing the appropriate indexes, then the look up performance will drop by a factor of two (if you are merging two equally sized tables). However, if you are missing indexes, you likely don't care about the performance.
Lookup using a hashed index is constant-time. Lookup using a tree index is logarithmic, so the effect is small. Writes to a tree index are logarithmic as well and writes to a hash map are amortized constant.
don't suffer from premature optimization!!!
A good design is more important than peak performance. Address lookup is likely not your bottleneck. A bad code resulting from a bad database design far outweighs any benefits. If you make two tables, you are going to duplicate code. Code duplication is a maintainance nightmare.
Merge the tables. You will be thankful when you need to extend your application in the near future. You could want to make more address types. You could want to add common functionality to the addresses (formatting). Your customers will not notice the extra milisecond from traversing one more level of a binary tree. They will notice you have a hard time adding an extra feature and they will notice inconsistencies arising from code duplication.
You might even gain performance by merging the tables. While you might need to traverse an extra node in a tree, the tree might be more likely to be cached in memory and not need disk access. Disk access is expensive. You might reduce disk access by merging.
As #BenP.P.Tung already said, you don't need an extra table for an enumeration. Use an enumeration type.
If you just need to distinguish the address difference. I suggest what you need is a ENUM column in this merged table. If it is exist, you can add a new column like following,
alter table add addressTypes ENUM('delivery','invoice','home') DEFAULT NULL;
Or DEFAULT invoice something you think should be default when you can not get the required information.
Don't need to put all enum values at a time. Just what you needed now, and add more value in the future as following.
alter table change addressTypes addressTypes ENUM('delivery','invoice','home','office') DEFAULT NULL;
One table will work fine. If there is a performance concern, then add the address type column to the primary index at the start of the index. This will avoid any performance issues until you have a very large number of addresses.
their structure is identical.
Are their constraints identical as well?1
If yes, merge the addresses.
If no, keep them separate.
Constraints are as much part of the table as are its fields.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
Do you query both kinds of addresses in the same way?
If yes, it shouldn't matter either way (assuming you indexed correctly).
If not, then different tables enable you to index or cluster your data differently.
Related posts:
Data modeling for Same tables with same columns
Two tables with same columns or one table with additional column?
1 For example, are both delivery and invoice supposed to be able to reference (through foreign keys) the same address? Are PKs of addresses supposed to be unique for all addresses or just for addresses of particular type? Are there any CHECKs that exist for one address type and not for the other? Etc, etc...
Related
this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.
i have some information that are common between multiple tables like:
Document information (can be used initially in FOLDER and CONTRACT
tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).
To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:
ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA
for example we can retrieve the information about a document with the following query:
SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";
we can also use the following index to optimize the query:
CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);
My questions are:
is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can
have many documents, so i create a folder_document table with the
folder_id as a foreign key. and do the same for all the tables.
Any suggestions or notes are very welcomed and thank you in advance for the help.
What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.
The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.
So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.
Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.
Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.
If you can't decide which of your queries deserves priority, or you can't predict whether you will have other new queries in the future, then you should stick with a normalized design.
There's no way anyone on Stack Overflow can know your queries better than you do.
Case 1: status
"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.
Case 2: address
"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.
However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.
One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.
Discussion
Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)
Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:
2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)
Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.
"many-to-many" requires an extra table for connecting the two tables.
ENUM
Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.
ID
id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.
We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.
Would the following relationships between the tables work out?
There are over 4000 rows for Airline Data, 150k rows for RAW DATA and
about 2000 rows for Airports.
I cannot create a primary key for RAW DATA because there are many repeated values.
http://i108.photobucket.com/albums/n32/lurker3345/ACCESSHELP-1.png
The relationships look fine. I assume many things -- for starters, that the data types match where they are linked. The diagram doesn't communicate much, and there could be many reasons why the schema shown is not optimal.
You certainly can create a PK for RAW DATA, and you had better because it is voluminous.
A common approach is to select multiple fields to serve as the key because together they obtain a unique value. This is called a compound key. It's helpful (even essential) because it naturally ensures the unique combination is not unintentially duplicated. (In most situations you will want to make sure all key fields are set to not allow a zero-length or null entry.)
There is a simpler approach that serves many situations. Maybe you don't need this kind of data integrity, or you aren't sure yet what would make up a compound key, or you just want to get a provisional PK in place. Merely add an autonumber field and declare that as PK.
Some developers take that easy approach and accomplish data validation outside the table...and some ignore data validation needs, which can result in a disaster.
Once you have the PK declared, making sure the table has indexes on critical fields (in addition to the PK) is important for efficiency.
Really, before all else, do yourself a favor and rename all tables and fields so there are no spaces. While at it, rethink every name and try for most descriptive and standardized name possible. Access is cruel when it comes to renaming things later on. Avoiding spaces is a practice that will help you greatly further down the road.
The entities to be stored have 25+ properties (table columns). The entities are pretty diverse, meaning that, most of the columns are empty. On average, I'd say, less than 20% (<5) properties have a value in any particular item. So, I have a lot of redundant empty columns for most of the table rows. Almost all of the columns are decimal numbers.
Given this scenario, would you suggest serializing the columns instead, or perhaps, create another table named "Property", which would contain all the possible properties and then creating yet another table "EntityProperty" which would map an property to an entity using foreign keys? Or would you leave it as it is?
An example scenario where this kind of redundancy might occur could be the following:
We have an imaginary universe with lots of planets. We are creating a space mining game and each planet has 30 different mineral contents. Most of the planets have only 2-3 minerals.
The simplest solution would be to create a single table 'Planets' with 30 columns, one for each mineral. The problem here is that most of rows in the 'Planets' table have 25+ columns, in which each of one the value is null or zero. Thus we have lot of redundant data. Say, we would have 500k-1M records. I would guess it costs a byte at most to save a null or zero decimal value. Thus, we waste space 500,000-1,000,000 bytes, ie. one megabyte at most.
The other solution would be to create two additional tables. Instead of storing all the minerals in the 'Planets' table, we take them all out and create a table for the minerals called 'Minerals'. This would contain only 30 rows, one for each different mineral type. Then, we create a table called 'PlanetMineral' which contains a reference to a planet row and to a mineral row, and additionally this table would have a column telling the amount of the mineral the planet has. Apparently, in many database systems this complicates queries since you have to do possible several joins. I'm using SQL server with LINQ to SQL, which scaffolds the foreign key constraint into class object property, accessible through code. (ie. I can simply access the minerals a planet has with planet.Minerals) So, from this perspective it doesn't add complexity. The redundancy is a small portion (like 1/15) of the first solution. The reason there is still some overhead is because of the foreign keys we need to store.
As for the data query efficiency, I really don't know how the costs of the queries would compare between these two solutions.
It depends:
How many entities (rows) you are planning to have?
What kind of queries you run against that table?
Will there be a lot of new properties in future?
How are you planning to use the properties?
You seem to be concerned about wasting space with simple table? Try to calculate if space saving with other approaches are really significant and worthwhile. The disk is (usually) cheap.
If you have low number of rows, then the single table is probably better (it is easier to implement).
If you plan to create complex queries against the properties (eg. where property1 < 123) then the simple table is probably easier.
If you are planing to add lot of new properties in the future then the Property/EntityProperties approach could be useful.
I'd go with the simple one table approach because you have a rather small amount of rows (<1M), you are probably running your database with server machines and not some handheld/mobile thing (SQLServer) and your database schema is rather rigid.
For numbers, I would personally leave it as is, in 1 table. Numbers are compressed into a few bytes, and the overhead for having an EntityProperty table would far outweight that. Serializing is an option, but it means you cannot use SQL to search or compute the properties, you have to get the data, deserialise, and then compute.
How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.