DB Design - any way to avoid duplicating columns here? - mysql

I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!

Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.

The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)

If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.

Related

Index every column to add foreign keys

I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com

Would this style of relating of tables work?

Would the following relationships between the tables work out?
There are over 4000 rows for Airline Data, 150k rows for RAW DATA and
about 2000 rows for Airports.
I cannot create a primary key for RAW DATA because there are many repeated values.
http://i108.photobucket.com/albums/n32/lurker3345/ACCESSHELP-1.png
The relationships look fine. I assume many things -- for starters, that the data types match where they are linked. The diagram doesn't communicate much, and there could be many reasons why the schema shown is not optimal.
You certainly can create a PK for RAW DATA, and you had better because it is voluminous.
A common approach is to select multiple fields to serve as the key because together they obtain a unique value. This is called a compound key. It's helpful (even essential) because it naturally ensures the unique combination is not unintentially duplicated. (In most situations you will want to make sure all key fields are set to not allow a zero-length or null entry.)
There is a simpler approach that serves many situations. Maybe you don't need this kind of data integrity, or you aren't sure yet what would make up a compound key, or you just want to get a provisional PK in place. Merely add an autonumber field and declare that as PK.
Some developers take that easy approach and accomplish data validation outside the table...and some ignore data validation needs, which can result in a disaster.
Once you have the PK declared, making sure the table has indexes on critical fields (in addition to the PK) is important for efficiency.
Really, before all else, do yourself a favor and rename all tables and fields so there are no spaces. While at it, rethink every name and try for most descriptive and standardized name possible. Access is cruel when it comes to renaming things later on. Avoiding spaces is a practice that will help you greatly further down the road.

Entire Table Stored in Indexes

I have a question that involves database design. For an application that I am building, a certain set of unique identifiers needs to be related to a variable amount of data. The solution that I built involves two tables.
The first table has an auto incrementing primary key ID, along with two columns that are both in a unique index (which work to identify the specific set of data). The second table then references the primary key, and stores data along with this key.
Using this technique, I am able to link the two identifiers that are contained in the unique index in the first table with a variable amount of rows in the second table.
I know that this will work, but my question involves the viability of this structure. Is it poor database design to have the entire first table contained in indexes? Can anyone think of any better solution that does not involve duplicating the identifiers used in the first table?
I am using MySQL along with innodb, if it is pertinent to the question.
Is it poor database design to have the
entire first table contained in
indexes?
Not in your case. The real question should probably be, "What are the candidate keys in the second table?"
In your case, you can think of your "first" table as an enumeration of the valid values implied in a hypothetical CHECK() constraint.
Have you ever heard of domain-key normal form (DKNF)? The more familiar 3NF, BCNF, 4NF, and 5NF are special cases of DKNF.

Is it better to have one large table or should I use several?

I'm building a new application that has a number of data objects and each one needs "history" or notes. In the past I have just created one database table called notes and had a number of foreign keys attached to the different objects. This time I would like others thoughts. Is it good practice/efficient to use one table with ever increasing auto_inc IDs or should I maintain different [object]_notes type tables?
N.B. The Notes object itself would always be the same, subject, text, date etc.
I'd use only 1 table. I assume we're not talking gazillions of history notes?
If not, then 1 table is just fine
I think the question you are asking is if an auto incrementing ID is as good a primary key as composite natural keys, or a key composed of 2 entities.
Unless you have a good reason to do so, I would stick to the autoincrement Primary Key, it has a unique index thus optimized for read lookups. You can do still do an index on composite keys. Some actually prefer it that way as it can be argued that it makes the relationships clearer & clenaer by not having the extra column on each table, but for small applications and datasets I don't worry about that and just use the autoincrement option.

Mysql auto increment primary key id's

I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.