Would the following relationships between the tables work out?
There are over 4000 rows for Airline Data, 150k rows for RAW DATA and
about 2000 rows for Airports.
I cannot create a primary key for RAW DATA because there are many repeated values.
http://i108.photobucket.com/albums/n32/lurker3345/ACCESSHELP-1.png
The relationships look fine. I assume many things -- for starters, that the data types match where they are linked. The diagram doesn't communicate much, and there could be many reasons why the schema shown is not optimal.
You certainly can create a PK for RAW DATA, and you had better because it is voluminous.
A common approach is to select multiple fields to serve as the key because together they obtain a unique value. This is called a compound key. It's helpful (even essential) because it naturally ensures the unique combination is not unintentially duplicated. (In most situations you will want to make sure all key fields are set to not allow a zero-length or null entry.)
There is a simpler approach that serves many situations. Maybe you don't need this kind of data integrity, or you aren't sure yet what would make up a compound key, or you just want to get a provisional PK in place. Merely add an autonumber field and declare that as PK.
Some developers take that easy approach and accomplish data validation outside the table...and some ignore data validation needs, which can result in a disaster.
Once you have the PK declared, making sure the table has indexes on critical fields (in addition to the PK) is important for efficiency.
Really, before all else, do yourself a favor and rename all tables and fields so there are no spaces. While at it, rethink every name and try for most descriptive and standardized name possible. Access is cruel when it comes to renaming things later on. Avoiding spaces is a practice that will help you greatly further down the road.
Related
this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.
i have some information that are common between multiple tables like:
Document information (can be used initially in FOLDER and CONTRACT
tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).
To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:
ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA
for example we can retrieve the information about a document with the following query:
SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";
we can also use the following index to optimize the query:
CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);
My questions are:
is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can
have many documents, so i create a folder_document table with the
folder_id as a foreign key. and do the same for all the tables.
Any suggestions or notes are very welcomed and thank you in advance for the help.
What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.
The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.
So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.
Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.
Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.
If you can't decide which of your queries deserves priority, or you can't predict whether you will have other new queries in the future, then you should stick with a normalized design.
There's no way anyone on Stack Overflow can know your queries better than you do.
Case 1: status
"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.
Case 2: address
"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.
However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.
One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.
Discussion
Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)
Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:
2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)
Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.
"many-to-many" requires an extra table for connecting the two tables.
ENUM
Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.
ID
id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.
I am aware of benefits of using integers (amount of space, performance, indexes) as primary keys as opposite to strings.
Considering situation below...
I have a lookup table called ap_habitat (habitat values are also unique)
id habitat
1 Forest 1
2 Forest 2
Referenced table (fauna)
Especie habitat
X 1
Y 1
Referenced table is not very human readable (I know end users should not care about that, but for me would be useful to directly see in fauna table the NAME of the habitat).
To get a list of fauna and its habitat name I have to do a join...
select fauna.habitat, fauna.especie, AP_h.habitat from fauna INNER JOIN ap_habitat AS AP_h on AP_h.id=1
I could create a view, but if I have to create a view for each table referencing a foreign key...
Just wanna check what more experienced people recommend me.
Databases and, in general, computers are not designed to make your life more simple. They are designed to handle more data than a human mind can ever hope to remember in less time than it takes a human to blink. ;-)
Readability (especially in ideas conceived the before-Apple age) is not an issue at all.
On top of that: If you enjoy strange problems, data mapping impedance and spending endless nights writing workarounds for problems that using real-world names as primary keys get you for free, then be our guest. But please, don't ask for our help. We already know all the problems that you'll run into and it will be very hard for us to restrain our spite.
So: Never, ever use anything but an ID (UUID or long sequence) for a primary key. There are no (good) reasons to do it and if you found one, then you simply don't see the whole picture.
Yes, it makes a couple of things harder (like understanding what your data actually means). But as I said above, computers are meant to solve "lots of data" and "too slow" and nothing else.
Create a view or write a small helper application that can run your most important queries at the click of a button.
That said, I had some success with an application which runs a query and then displays a list of check boxes where I can pull in the foreign key relations to the data that the query returns (i.e. one checkbox per FK).
You ask about number or string as primary key. But based on your example if you use a string it wouldn't be a primary key at all, because you would no longer have a lookup table for it to be the primary key of. Perhaps you would still have the table for reasons not shown, like populating a drop down or storing extended descriptions beyond just the name.
Doing needless joins is not a good thing for performance. And having needless tables might be bad for storage size as well, depending on the length of the strings and the ratio of the sizes of the two tables.
You could also consider enumerated types, in which the data is stored as numbers (more or less) but the database translates them to and from strings automatically.
I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!
Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.
The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)
If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.
I have been working on my database and the thought occurred to me that maybe it would be better to combine two of my tables to better organise the data and perhaps get performance benefits (or not?).
I have two tables that contain addresses, delivery and the other invoice, their structure is identical. One table contains invoice addresses and the other contains delivery.
What would be the implications of merging these together into one table simply called "addresses", and create a new column called addressTypeId? This new column references a new table that contains address types like delivery, invoice, home etc.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
By the way I am using INNODB.
If you are missing the appropriate indexes, then the look up performance will drop by a factor of two (if you are merging two equally sized tables). However, if you are missing indexes, you likely don't care about the performance.
Lookup using a hashed index is constant-time. Lookup using a tree index is logarithmic, so the effect is small. Writes to a tree index are logarithmic as well and writes to a hash map are amortized constant.
don't suffer from premature optimization!!!
A good design is more important than peak performance. Address lookup is likely not your bottleneck. A bad code resulting from a bad database design far outweighs any benefits. If you make two tables, you are going to duplicate code. Code duplication is a maintainance nightmare.
Merge the tables. You will be thankful when you need to extend your application in the near future. You could want to make more address types. You could want to add common functionality to the addresses (formatting). Your customers will not notice the extra milisecond from traversing one more level of a binary tree. They will notice you have a hard time adding an extra feature and they will notice inconsistencies arising from code duplication.
You might even gain performance by merging the tables. While you might need to traverse an extra node in a tree, the tree might be more likely to be cached in memory and not need disk access. Disk access is expensive. You might reduce disk access by merging.
As #BenP.P.Tung already said, you don't need an extra table for an enumeration. Use an enumeration type.
If you just need to distinguish the address difference. I suggest what you need is a ENUM column in this merged table. If it is exist, you can add a new column like following,
alter table add addressTypes ENUM('delivery','invoice','home') DEFAULT NULL;
Or DEFAULT invoice something you think should be default when you can not get the required information.
Don't need to put all enum values at a time. Just what you needed now, and add more value in the future as following.
alter table change addressTypes addressTypes ENUM('delivery','invoice','home','office') DEFAULT NULL;
One table will work fine. If there is a performance concern, then add the address type column to the primary index at the start of the index. This will avoid any performance issues until you have a very large number of addresses.
their structure is identical.
Are their constraints identical as well?1
If yes, merge the addresses.
If no, keep them separate.
Constraints are as much part of the table as are its fields.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
Do you query both kinds of addresses in the same way?
If yes, it shouldn't matter either way (assuming you indexed correctly).
If not, then different tables enable you to index or cluster your data differently.
Related posts:
Data modeling for Same tables with same columns
Two tables with same columns or one table with additional column?
1 For example, are both delivery and invoice supposed to be able to reference (through foreign keys) the same address? Are PKs of addresses supposed to be unique for all addresses or just for addresses of particular type? Are there any CHECKs that exist for one address type and not for the other? Etc, etc...
I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.