Database Smell - Improve current design with multiple tables - mysql

I am in the process of creating a second version of my technical wiki site and one of the things I want to improve is the database design. The problem (or so I think) is that to display each document, I need to join upwards of 15 tables. I have a bunch of lookup tables that contain descriptive data associated with each wiki entry such as programmer used, cpu, tags, peripherals, PCB layout software, difficulty level, etc.
Here is an example of the layout:
doc
--------------
id | author_id | doc_type_id .....
1 | 8 | 1
2 | 11 | 3
3 | 13 | 3
_
lookup_programmer
--------------
doc_id | programmer_id
1 | 1
1 | 3
2 | 2
_
programmer
--------------
programmer_id | programmer
1 | USBtinyISP
2 | PICkit
3 | .....
Since some doc IDs may have multiples entries for a single attribute (such as programmer), I have created the DB to compensate for this. The other 10 attributes have a similiar layout as the 2 programmer tables above. To display a single document article, approx 20 tables are joined.
I used the Sphinx Search engine for finding articles with certain characteristics. Essentially Sphinx indexes all of the data (does not store) and returns the wiki doc ID of interest based on the filters presented. If I want to find articles that use a certain programmer and then sort by date, MYSQL has to first join ALL documents with the 2 programmer tables, then filter, and finally sort the remaining by insert time. No index can help me ordering the filtered results (takes a LONG time with 150k doc IDs) since it is done in a temporary table. As you can imagine, it gets worse really quickly with the more parameters that need to be filtered.
It is because I have to rely on Sphinx to return - say all wiki entries that use a certain CPU AND programer - that lead me to believe that there is a DB smell with my current setup....
edit: Looks like I have implemented a [Entity–attribute–value model]1

I don't see anything here that suggests you've implemented EAV. Instead, it looks like you've assigned every row in every table an ID number. That's a guaranteed way to increase the number of joins, and it has nothing to do with normalization. (There is no "I've now added an id number" normal form.)
Pick one lookup table. (I'll use "programmer" in my example.) Don't build it like this.
create table programmer (
programmer_id integer primary key,
programmer varchar(20) not null,
primary key (programmer_id),
unique key (programmer)
);
Instead, build it like this.
create table programmer (
programmer varchar(20) not null,
primary key (programmer)
);
And in the tables that reference it, consider cascading updates and deletes.
create table lookup_programmer (
doc_id integer not null,
programmer varchar(20) not null,
primary key (doc_id, programmer),
foreign key (doc_id) references doc (id)
on delete cascade,
foreign key (programmer) references programmer (programmer)
on update cascade on delete cascade
);
What have you gained? You keep all the data integrity that foreign key references give you, your rows are more readable, and you've eliminated a join. Build all your "lookup" tables that way, and you eliminate one join per lookup table. (And unless you have many millions of rows, you're probably not likely to see any degradation in performance.)

Related

Does holes in indexes impact somehow the database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
When creating a new index it seems that people try to avoid holes inside it, and usually use auto-incrementation. But why? What the reasons behind that? Maintenance? Security? Or simply not beautifull?
Because in my case, I'm suppose to create a book catalog database and for consistency reasons I would like to make sure that the index of the "book" table matches the fragment of the ISBN number corresponding to the publication number of the 1st edition of the book at this publisher.
However some reissues have their own ISBN but won't be counted as a book entity in itself and so will create holes (data of reissues will be merged with 1st edition data).
I use MySQL 5.7.23 with phpMyAdmin.
Here the view from the junction of the tables "Book" and "ISBN" I aim
num_book | ISBN
--------------------------------
1 | XXX-X-XXXXXX-1-X
| XXX-X-XXXXXX-5-X
| XXX-X-XXXXXX-9-X
| XXX-X-XXXXXX-14-X
2 | XXX-X-XXXXXX-2-X
3 | XXX-X-XXXXXX-3-X
| XXX-X-XXXXXX-6-X
| XXX-X-XXXXXX-8-X
4 | XXX-X-XXXXXX-4-X
7 | XXX-X-XXXXXX-7-X
| XXX-X-XXXXXX-13-X
10 | XXX-X-XXXXXX-10-X
11 | XXX-X-XXXXXX-11-X
12 | XXX-X-XXXXXX-12-X
15 | XXX-X-XXXXXX-15-X
I intend to use "num_block" with these intentional holes as primary key of the table book and then join with ISBN table.
The index numbers will remain increasing but wouldn't necessarily be successive (i.e. 1, 2, 3, 4, 7, 10, 11, 12, 15)
Should I worry about that and why ?
Thanks by advance for your attention.
Edit : Oups as scaisEdge said, forgot can't start index with 0, corrected.
More clarifications & disambiguations about explanations and the sketch (add legend) : it's not the same table a but a view from the join of two tables (books and ISBN), so "num_book" value are unique but can be bind to severals "ISBN".
I think you are referring to a few different concepts all at the same time.
There is a difference between a primary key and an index.
A primary key is a logical concept - it provides the unique, unchanging reference to a row in your table. As other entities refer to the primary key, it may not be null.
An index is a physical concept - it's a way for the database to look up entries in that column. You can specify that an index is not null, and unique.
The usual way to physically implement the logical concept of primary key is through a unique, not-null index.
The next question is how to assign the primary key; there are two candidates: natural keys reflect an entity in the problem domain, and surrogate keys are assigned automagically by the database.
In practice, there are very few natural keys (guaranteed unique, not null, unchanging) - I don't know enough about how ISBNs are assigned to have an opinion whether they are suitable. But I've seen problems with social security numbers (they get entered incorrectly into the system), phone numbers (people change their phone number), etc.
Surrogate keys are assigned by the database engine. They are often auto-incrementing integers, but they can also be UUIDs - as long as they are guaranteed unique and not null. The reason auto-incrementing integers are popular is for a couple of reasons.
Many primary keys are implemented using clustered indexes. A clustered index affects the order in which data is stored on disk, so if you have a clustered index, inserting record with ID 1 after you've written record with ID 1000 means re-ordering the data on disk, which is expensive.
Gaps are not really a problem - as long as you're inserting sequentially.
However...this logic is from the 1980s. Back then, a clustered index was notably faster than a non-clustered index. On modern hardware, that's not true in most circumstances.
So, there is no obvious reason why your scheme for assigning primary keys would be a problem as long as you are confident about the way ISBNs are assigned.

MySQL Database Normalization .. one table to connect multiple others?

Let's assume I have a very large database with tons of tables in it.
Certain of these tables contain datasets to be connected to each other like
table: album
table: artist
--> connected by table: album_artist
table: company
table: product
--> connected by table: company_product
The tables album_artist and company_product contain 3 columns representing primary key, albumID/artistID meanwhile companyID/productID...
Is it a good practice to do something like an "assoc" table which is made up like
---------------------------------------------------------
| id int(11) primary | leftID | assocType | rightID |
|---------------------------------------------------------|
| 1 | 10 | company:product | 4 |
| 2 | 6 | company:product | 5 |
| 3 | 4 | album:artist | 10 |
---------------------------------------------------------
I'm not sure if this is the way to go or if there's anything else than creating multiple connection tables?!
No, it is not a good practice. It is a terrible practice, because referential integrity goes out the window. Referential integrity is the guarantee provided by the RDBMS that a foreign key in one row refers to a valid row in another table. In order for the database to be able to enforce referential integrity, each referring column must refer to one and only one referred column of one and only one referred table.
No, no, a thousand times no. Don't overthink your many-to-many relationships. Just keep them simple. There's nothing to gain and a lot to lose by trying to consolidate all your relationships in a single table.
If you have a many to many relationship between, say guiarist and drummer, then you need a guitarist_drummer table with two columns in it: guitarist_id and drummer_id. That table's primary key should be comprised of both columns. And you should have another index that's made of the two columns in the opposite order. Don't add a third column with an autoincrmenting id to those join tables. That's a waste, and it allows duplicated pairs in those tables, which is generally confusing.
People who took the RDBMS class in school will immediately recognize how these tables work. That's good, because it means you don't have to be the only programmer on this project for the rest of your life.
Pro tip: Use the same column name everywhere. Make your guitarist table contain a primary key called guitarist_id rather than id. It makes your relationship tables easier to understand. And, if you use a reverse engineering tool like Sql Developer that tool will have an easier time with your schema.
The answer is that it "depends" on the situation. In your case and most others, no, it does not make sense. It does make sense if you are doing a many <-> many relationship, the constraints can be enforced by the link table with foreign keys and a unique constraint. Probably the best use case would be if you had numerous tables pointing to a single table. Each table could have a link table with indexes on it. This would be beneficial if one of the tables is a large table, and you need to fetch the linked records separately.

Better PK for future safe data intensive Databases

We are really having a technical trouble of designing the primary keys for our new data intensive project.
Please explain us which PK design is better for our data intensive database.
The database is data intensive and persistence.
Atleast 3000 users access it per second.
Please tell us technically which type of PK is better for our database and the tables are less likely to change in the future.
1.INT/BIGINT auto increment column as PK
2.Composite keys.
3.Unique varchar PK.
I would go for option 1, using a BIGINT autoincrement column as the PK. The reason is simple, each write will write to the end of the current page, meaning inserting new rows is very fast. If you use a composite key, then you need an order, and unless you are inserting in the order of the composite key, then you need to split pages to insert, e.g. Imagine this table:
A | B | C
---+---+---
1 | 1 | 4
1 | 4 | 5
5 | 1 | 2
Where the primary key is a composite key on (A, B, C), suppose I want to insert (2, 2, 2), it would need to be inserted as follows:
A | B | C
---+---+---
1 | 1 | 4
1 | 4 | 5
2 | 2 | 2 <----
5 | 1 | 2
So that the clustered key maintains its order. If the page you are already inserting too is already full, then MySQL will need to split the page, moving some of the data to a new page to make room for the new data. These page splits are quite costly, so unless you know you are inserting sequential data then using an autoincrement column as the clustering key means that unless you mess around with the increments you should never have to split a page.
You could still add a unique index to the columns that would be the primary key to maintain integrity, you would still have the same problem with splits on the index, but since the index would be narrower than a clustered index the splits would be less frequent as more data will fit on a page.
More or less the same argument applies against a unique varchar column, unless you have some kind of process that ensures the varchar is sequential, but generating a sequential varchar is more costly than an autoincrement column, and I can see no immediate advantage.
This is not easy to answer.
To start with, using composite keys as primary keys is the straight-forward way. IDs come in handy when the database structure changes.
Say you have products in different sizes sold in different countries. Primary keys are bold.
product (product_no, name, supplier_no, ...)
product_size (product_no, size, ean, measures, ...)
product_country (product_no, country_isocode, translated_name, ...)
product_size_country (product_no, size, country_isocode, vat, ...)
It is very easy to wite data, because you are dealing with natural keys, which is what users work with. The dbms garantees data consistency.
Now the same with technical IDs:
product (product_id, product_no, name, supplier_no, ...)
product_size (product_size_id, size, product_id, ean, measures, ...)
product_country (product_country_id, product_id, country_id, translated_name, ...)
product_size_country (product_size_country_id, product_size_id, country_id, vat, ...)
To get the IDs is an additional step needed now, when inserting data. And still you must ensure that product_no is unique. So the unique constraint on product_id doesn't replace that constraint on product_no, but adds to it. Same for product_size, product_country and product_size_country. Moreover product_size_country may now link to product_country and product_size_country of different products. The dbms cannot guarantee data consistency any longer.
However, natural keys have their weakness when changes to the database structure must be made. Let's say that a new company is introduced in the database and product numbers are only unique per company. With the ID based database you would simply add a company ID to the products table and be done. In the natural key based database you would have to add the company to all primary keys. Much more work. (However, how often must such changes be made to a database. In many databases never.)
What more is there to consider? When the database gets big, you might want to partitionate tables. With natural keys, you could partition your tables by said company, assuming that you will usually want to select data from one company or the other. With IDs, what would you partition the tables by to enhance access?
Well, both concepts certainly have pros and cons. As to your third option to create a unique varchar, I see no benefit in this over using integer IDs.

Composite primary keys or surrogates when dealing with date/time

I've seen a lot of discussion regarding this. I'm just seeking for your suggestions regarding this. Basically, what I'm using is PHP and MySQL. I have a users table which goes:
users
------------------------------
uid(pk) | username | password
------------------------------
12 | user1 | hashedpw
------------------------------
and another table which stores updates by the user
updates
--------------------------------------------
uid | date | content
--------------------------------------------
12 | 2011-11-17 08:21:01 | updated profile
12 | 2011-11-17 11:42:01 | created group
--------------------------------------------
The user's profile page will show the 5 most recent updates of a user. The questions are:
For the updates table, would it be possible to set both uid and date as composite primary keys with uid referencing uid from users
OR would it be better to just create another column in updates which auto-increments and will be used as the primary key (while uid will be FK to uid in users)?
Your idea (under 1.) rests on the assumption that a user can never do two "updates" within one second. That is very poor design. You never know what functions you will implement in the future, but chances are that some day 1 click leads to 2 actions and therefore 2 lines in this table.
I say "updates" quoted because I see this more as a logging table. And who knows what you may want to log somewhere in the future.
As for unusual primary keys: don't do it, it almost always comes right back in your face and you have to do a lot of work to add a proper autoincremented key afterwards.
It depends on the requirement but a third possibility is that you could make the key (uid, date, content). You could still add a surrogate key as well but in that case you would presumably want to implement both keys - a composite and a surrogate - not just one. Don't make the mistake of thinking you have to make an either/or choice.
Whether it is useful to add the surrogate or not depends on how it's being used - don't add a surrogate unless or until you need it. In any case uid I would assume to be a foreign key referencing the users table.

Database Design: need unique rows + relationships

Say I have the following table:
TABLE: product
============================================================
| product_id | name | invoice_price | msrp |
------------------------------------------------------------
| 1 | Widget 1 | 10.00 | 15.00 |
------------------------------------------------------------
| 2 | Widget 2 | 8.00 | 12.00 |
------------------------------------------------------------
In this model, product_id is the PK and is referenced by a number of other tables.
I have a requirement that each row be unique. In the example about, a row is defined to be the name, invoice_price, and msrp columns. (Different tables may have varying definitions of which columns define a "row".)
QUESTIONS:
In the example above, should I make name, invoice_price, and msrp a composite key to guarantee uniqueness of each row?
If the answer to #1 is "yes", this would mean that the current PK, product_id, would not be defined as a key; rather, it would be just an auto-incrementing column. Would that be enough for other tables to use to create relationships to specific rows in the product table?
Note that in some cases, the table may have 10 or more columns that need to be unique. That'll be a lot of columns defining a composite key! Is that a bad thing?
I'm trying to decide if I should try to enforce such uniqueness in the database tier or the application tier. I feel I should do this in the database level, but I am concerned that there may be unintended side effects of using a non-key as a FK or having so many columns define a composite key.
When you have a lot of columns that you need to create a unique key across, create your own "key" using the data from the columns as the source. This would mean creating the key in the application layer, but the database would "enforce" the uniqueness. A simple method would be to use the md5 hash of all the sets of data for the record as your unique key. Then you just have a single piece of data you need to use in relations.
md5 is not guaranteed to be unique, but it may be good enough for your needs.
First off, your intuition to do it in the DB layer is correct if you can do it easily. This means even if your application logic changes, your DB constraints are still valid, lowering the chance of bugs.
But, are you sure you want uniqueness on that? I could easily see the same widget having different prices, say for sale items or what not.
I would recommend against enforcing uniqueness unless there's a real reason to.
You might have something like this (obvoiusly, don't use * in production code)
# get the lowest price for an item that's currently active
select *
from product p
where p.name = "widget 1" # a non-primary index on product.name would be advised
and p.active
order-by sale_price ascending
limit 1
You can define composite primary keys and also unique indexes. As long as your requirement is met, defining composite unique keys is not a bad design. Clearly, the more columns you add, the slower the process of updating the keys and searching the keys, but if the business requirement needs this, I don't think it is a negative as they have very optimized routines to do these.