What is your opinion on using textual identifiers in table columns when approaching the database with normalization and scalability in mind? - mysql

Which table structure is considered better normalized ?
for example
Note: idType tells on which thing the comment has taken place on, and the subjectid is the id of the item the comment has taken place on.
useing idType the textually named identifier for the subjectid.
commentid ---- subjectid ----- idType
--------------------------------------
1 22 post
2 26 photo
3 84 reply
4 36 post
5 22 status
Compared to this.
commentid ---- postid ----- photoid-----replyid
-----------------------------------------------
1 22 NULL NULL
2 NULL 56 NULL
3 23 NULL NULL
4 NULL NULL 55
5 26 NULL NULL
I am looking at both of them and I dont think in the first table I would be able to relate it to a foreign key constraint =( (ie. comment gets deleted if the post or photo is deleted), where as in the second one that is possible, how would you approach a similar issue keeping in mind that the database will need to expand overtime and data integrity is also important =).
Thanks

The first is more normalized, if slightly incomplete. There are a couple of approaches you can take, the simplest (and strictly speaking, the most 'correct') will need two tables, with the obvious FK constraint.
commentid ---- subjectid ----- idType
--------------------------------------
1 22 post
2 26 photo
3 84 reply
4 36 post
5 22 status
idType
------
post
photo
reply
status
If you like, you can use a char(1) or similar to reduce the impact of the varchar on key/index length, or to facilitate use with an ORM if you plan to use one. NULL's are always a bother, and if you start to see them turn up in your design, you will be better off if you can figure out a convenient way to eliminate them.
The second approach is one I prefer when dealing with more than 100 million rows:
commentid ---- subjectid
------------------------
1 22
2 26
3 84
4 36
5 22
postIds ---- subjectid
----------------------
1 22
4 36
photoIds ---- subjectid
-----------------------
2 26
replyIds ---- subjectid
-----------------------
3 84
statusIds ---- subjectid
------------------------
5 22
There is of course also the (slightly denormalized) hybrid approach, which I use extensively with large datasets, as they tend to be dirty. Simply provide the specialization tables for the pre-defined idTypes, but keep an adhoc idType column on the commentId table.
Note that even the hybrid approach only requires 2x the space of the denormalized table; and provides trivial query restriction by idType. The integrity constraint however is not straight forward, being an FK constraint on a derived UNION of the type-tables. My general approach is to use a trigger on either the hybrid table, or an equivalent updatable-view to propigate updates to the correct sub-type table.
Both the simple approach and the more complex sub-type table approach work; still, for most purposes KISS applies, so just I suspect you should probably just introduce an ID_TYPES table, the relevant FK, and be done with it.

Related

Is it okay to have non sequential ids as primary keys for a table in your database?

I don't know enough about databases to find the right words to ask this question, so let me give an example to explain what I'm trying to do: Suppose I want the primary key for a table to be an ID I grab from an API, but the majority of those API requests result in 404 errors. As a result, my table would look like this:
I also don't know how to format a table-like structure on Stack Overflow, so this is going to be a rough visual:
API_ID_PK | name
------------------
1 | Billy
5 | Timmy
23 | Richard
54 | Jobert
104 | Broccoli
Is it okay for the ID's not to be sequentially separated by 1 digit? Or should I do this:
ID PK | API_ID | NAME
----------------------------------------
1 | 1 | Billy
2 | 5 | Timmy
3 | 23 | Richard
4 | 54 | Jobert
5 | 104 | Broccoli
Would the second table be more efficient for indexing reasons? Or is the first table perfectly fine? Thanks!
No, there won't be any effect on efficiency if you have non-consecutive IDs. In fact, MySQL (and other databases) allow for you to set a variable auto_increment_increment to have the ID increment by more than 1. This is commonly used in multi-master setups.
It's fine to have IDs not sequential. I regularly use GUIDs for IDs when dealing with enterprise software where multiple business could share the same object and they're never sequential.
The one thing to watch out for is if the numbers are the same. What's determining the ID value you're storing?
If you have a clustered index (Sql-Server) on a ID column and insert IDs with random values (like Guids), this can have a negative effect, as the physical order of the clustered index corresponds to the logical order. This can lead to a lot of index re-organisations. See: Improving performance of cluster index GUID primary key.
However, ordered but non consecutive values (values not separated by 1) are not a problem for clustered indexes.
For non-clustered indexes the order doesn't matter. It is okay to insert random values for primary keys as long as they are unique.

Most efficient, scalable mysql database design

I have interesting question about database design:
I come up with following design:
first table:
**Survivors:**
Survivor_Id | Name | Strength | Energy
second table:
**Skills:**
Skill_Id | Name
third table:
**Survivor_skills:**
Surviror_Id |Skill_Id | Level
In first table Survivors there will be many records and will grow from time to time.
In second table will be just few skills which can survivors learn (for example: recoon (higher view range), sniper (better accuracy), ...). Theese skills aren't like strength or energy which all survivors have.
Third table is the most interesting, there survivors and skills join together. Everything will work just fine but I am worried about data duplication.
For example: survivor with id 1 will have 5 skills so first table would look like this:
// survivor_id | level_id | level
1 | 1 | 2
1 | 2 | 3
1 | 3 | 1
1 | 4 | 5
1 | 5 | 1
First record: survivor with id 1 has skill with id 1 on level 2
Second record ...
Is this proper approach or should I use something different.
Looks good to me. If you are worried about data duplication:
1) your server-side code should be gear to not letting this happen
2) you could check before inserting if it already exists
3) you could use MYSQL: REPLACE INTO - this will replace duplicate rows if configure proerply, or insert new ones (http://dev.mysql.com/doc/refman/5.0/en/replace.html)
4) set a unique index on columns where you want only unique rows, e.g. level_id, level
I concur with the others - this is the proper approach.
However, there is one aspect which hasn't been discussed: the order of columns in the composite key {Surviror_Id, Skill_Id}, which will be governed by the kinds of queries you need to run...
If you need to find skills of the given survivor, the order needs to be: {Surviror_Id, Skill_Id}.
If you need to find survivors with the given skill, the order needs to be: {Skill_Id, Surviror_Id}.
If you need both, you'll need both the key (and the implied index) on {Surviror_Id, Skill_Id} and an index on {Skill_Id, Surviror_Id}1. Since InnoDB tables are clustered, accessing Level through that secondary index requires double-lookup - to avoid that, consider using a covering index {Skill_Id, Surviror_Id, Level} instead.
1 Or vice-verse.

Do I need a Primary Key If I'm using 1 to Many Relationship?

I have a table called branch
It looks something like.
+----------------+--------------+
| branch_id | branch_name |
+----------------+--------------+
| 1 | TestBranch1 |
| 2 | TestBranch2 |
+----------------+--------------+
I've set the branch_id as primary key.
Now my question is related to the next table called item
It looks like this.
+----------------+-----------+---------------------------+
| branch_id | item_id | item_name |
+----------------+-----------+---------------------------+
| 1 | 1 | Apple |
| 1 | 2 | Ball |
| 2 | 1 | Totally Difference Apple |
| 2 | 2 | Apple Apple 2 |
+----------------+-----------+---------------------------+
I'd like to know if I need to create a primary key for my item table?
UPDATE
They do not share the same items. Sorry for the confusion.. A branch can create a product that doesn't exist in the other branch. They are like two stores sharing the same database.
UPDATE
Sorry for the incomplete information.
These tables are actually from two local database...
I'm trying to create a database that can exist on its own but would still have no problem when mixed with another. So the system would just append all the item data from another branch without mixing them up.. The branches doesn't take the item_id of the other branches in consideration when generating a unique_id for their items. All the databases however may share same branch table as reference.
Thank you guys in advance.
I'd like to know if I need to create a primary key for my item table?
You always1 need a key2, whether the table is involved in a relationship3 or not. The only question is what kind of key?
Here are your options in this case:
Make {item_id} alone a key. This makes the relationship "non-identifying" and item a "strong" entity...
Which produces a slimmer key (compared to the second option), therefore any child tables that may reference it are slimmer.
Any ON UPDATE CASCADE actions are cut-off at the level of the item and not propagated to children.
May play better with ORMs.
Make a composite4 key on {branch_id, item_no}. This makes the relationship "identifying" and item a "weak" entity...
Which makes item itself slimmer (one less index).
Which may be very useful for clustering.
May help you avoid a JOIN in some cases (if there are child tables, branch_id is propagated to them).
May be necessary for correctly modelling "diamond-shaped" dependencies.
So pick your poison ;)
Of course, branch_id is a foreign key (but not key) in both cases.
And orthogonal to all that, if item_name has to be unique per-branch (as opposed to per whole table), you need a composite key on {branch_id, item_name} as well.
1 From the logical perspective, you always need a key, otherwise your table would be a multiset, therefore not a relation (which is a set), therefore your database would no longer be "relational". From the physical perspective, there may be some special cases for breaking this rule, but they are rare.
2 Whether its primary or not is immaterial from the logical standpoint, although it may be important if the DBMS ascribes a special meaning to it, such is the case with InnoDB which uses primary key as clustering key.
3 Please make a distinction between "relation" and "relationship".
4 Aka. "compound".
According to your example data you are using n to m relations and not 1 to m. It should be like this
item table
----------
item_id | item_name
1 | Apple
2 | Ball
branch_item table
-----------------
item_id | branch_id
1 | 1
1 | 2
2 | 1
2 | 2
And your brach_item table should have a compound unique key containg branch_id and item_id to make sure no duplicate entries can be added.
Yes you do. The Primary key is what allows the many to one relationship to exist.
This requirement is already catered for by the branch_id column.
The item_id column is not required for the one-to-many relationship in your example.

Using row values from one table to determine fields to select from another table in MySQL

I have multiple tables in a MySQL database, and I would like to be able to use row values from one table as the columns to select from another table. For instance, suppose that my tables description and information were as follows:
-------------
| description |
-------------
id colalias privacy
-- -------- -------
1 fname 3
2 lname 3
3 salary 2
4 empid 1
-------------
| information |
-------------
id fname lname salary empid
-- ----- ----- ------ -----
1 Bob White 50000 12345
2 Tom Black 75000 54321
3 Sue Green 82000 67890
4 Ann Brown 63000 09876
Suppose that I want to return a table that pulls all data from information where the description.privacy is equal to 3. So I'd like to have an output of
fname lname
----- -----
Bob White
Tom Black
Sue Green
Ann Brown
since only the fname and lname fields have that privacy level. I'm not an expert in writing SQL queries, and I have no control over the existing database design. If this is an obvious thing to do, I apologize for being ignorant of it, but I would truly appreciate some guidance.
You are actually really hampered here by your data model, the link between privacy level and data should be best modelled in tables by data alone, not by linking data (privacy level) to structure (field names).
Do you have any scope at all to change the data model, even if only adding in new stuff (i.e. not changing/removing existing model)?
What you're trying isn't impossible, but it's much harder work...
Edit - quick example of using dynamic SQL:
http://forums.mysql.com/read.php?60,27979,30437
Edit - example of a data model that would help, albeit it is a big refactor to what you're perhaps working with, so the dynamic SQL route might be more practical...
field_name field_security_level user_id user_emp_id field_value
---------- -------------------- ------- ----------- -----------
fname 3 1 12345 Bob
lname 3 1 12345 White
UPDATE - NEW ANSWER
After #Brian commenting I have realized that per my comment below, a mySQL subquery is likely going to do the trick allowing you to query based on a previous result.
There is a fairly good overview on them at RoseIndia.net that will give you a good start.
OLD ANSWER
You're most likely looking for a LEFT JOIN to achieve this.
I'm assuming that the id columns are related to each other to start with. Personally I would have 2 separate primary keys and then relate 1 table to the other using a foreign key but moving on.
I would suggest you check out KeithJBrown.co.uk that has a good explanation and some good examples of different types of joins.
Finally, to only get the columns you want, you simply specify them as SELECT information.fname, information.lname in the SQL statement.
I hope this helps you get on the right track, also, next time do a bit of research with Google first, try a few things then when you get stuck ask for help.

Whats the most efficient way to store an array of integers in a MySQL column?

I've got two tables
A:
plant_ID | name.
1 | tree
2 | shrubbery
20 | notashrubbery
B:
area_ID | name | plants
1 | forrest | *needhelphere*
now I want the area to store any number of plants, in a specific order and some plants might show up a number of times: e.g 2,20,1,2,2,20,1
Whats the most efficient way to store this array of plants?
Keeping in mind I need to make it so that if I perform a search to find areas with plant 2, i don't get areas which are e.g 1,20,232,12,20 (pad with leading 0s?) What would be the query for that?
if it helps, let's assume I have a database of no more than 99999999 different plants. And yes, this question doesn't have anything to do with plants....
Bonus Question
Is it time to step away from MySQL? Is there a better DB to manage this?
If you're going to be searching both by forest and by plant, sounds like you would benefit from a full-on many-to-many relationship. Ditch your plants column, and create a whole new areas_plants table (or whatever you want to call it) to relate the two tables.
If area 1 has plants 1 and 2, and area 2 has plants 2 and 3, your areas_plants table would look like this:
area_id | plant_id | sort_idx
-----------------------------
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
2 | 3 | 1
You can then look up relationships from either side, and use simple JOINs to get the relevant data from either table. No need to muck about in LIKE conditions to figure out if it's in the list, blah, bleh, yuck. I've been there for a legacy database. No fun. Use SQL to its greatest potential.
How about this:
table: plants
plant_ID | name
1 | tree
2 | shrubbery
20 | notashrubbery
table: areas
area_ID | name
1 | forest
table: area_plant_map
area_ID | plant_ID | sequence
1 | 1 | 0
1 | 2 | 1
1 | 20 | 2
That's the standard normalized way to do it (with a mapping table).
To find all areas with a shrubbery (plant 2), do this:
SELECT *
FROM areas
INNER JOIN area_plant_map ON areas.area_ID = area_plant_map.area_ID
WHERE plant_ID = 2
You know this violates normal form?
Typically, one would have an areaplants table: area_ID, plant_ID with a unique constraint on the two and foreign keys to the other two tables. This "link" table is what gives you many-many or many-to-one relationships.
Queries on this are generally very efficient, they utilize indexes and do not require parsing strings.
8 years after this question was asked, here's 2 ideas:
1. Use json type (link)
As of MySQL 5.7.8, MySQL supports a native JSON data type defined by RFC 7159 that enables efficient access to data in JSON (JavaScript Object Notation) documents.
2. Use your own codification
Turn area_id into a string field (varchar or text, your choice, think about performance), then you can represent values as for example -21-30-2-4-20- then you can filter using %-2-%.
If you somehow try one of these, I'd love it if you shared your performance results, with 100M rows as you suggested.
--
Remember than using any of these breaks first rule of normalization, which says every column should hold a single value
Your relation attributes should be atomic, not made up of multiple values like lists. It is too hard to search them. You need a new relation that maps the plants to the area_ID and the area_ID/plant combination is the primary key.
Use many-to-many relationship:
CREATE TABLE plant (
plant_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255)
) ENGINE=INNODB;
CREATE TABLE area (
area_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255)
) ENGINE=INNODB;
CREATE TABLE plant_area_xref (
plant_id INT NOT NULL,
area_id INT NOT NULL,
sort_idx INT NOT NULL,
FOREIGN KEY (plant_id) REFERENCES plant(plant_id) ON DELETE CASCADE,
FOREIGN KEY (area_id) REFERENCES area(area_id) ON DELETE CASCADE,
PRIMARY KEY (plant_id, area_id, sort_idx)
) ENGINE=INNODB;
EDIT:
Just to answer your bonus question:
Bonus Question Is it time to step away from MySQL? Is there a better DB to manage this?
This has nothing to do with MySQL. This was just an issue with bad database design. You should use intersection tables and many-to-many relationship for cases like this in every RDBMS (MySQL, Oracle, MSSQL, PostgreSQL etc).