All popular SQL databases, that I am aware of, implement foreign keys efficiently by indexing them.
Assuming a N:1 relationship Student -> School, the school id is stored in the student table with a (sometimes optional) index. For a given student you can find their school just looking up the school id in the row, and for a given school you can find its students by looking up the school id in the index over the foreign key in Students. Relational databases 101.
But is that the only sensible implementation? Imagine you are the database implementer, and instead of using a btree index on the foreign key column, you add an (invisible to the user) set on the row at the other (many) end of the relation. So instead of indexing the school id column in students, you had an invisible column that was a set of student ids on the school row itself. Then fetching the students for a given school is a simple as iterating the set. Is there a reason this implementation is uncommon? Are there some queries that can't be supported efficiently this way? The two approaches seem more or less equivalent, modulo particular implementation details. It seems to me you could emulate either solution with the other.
In my opinion it's conceptually the same as splitting of the btree, which contains sorted runs of (school_id, student_row_id), and storing each run on the school row itself. Looking up a school id in the school primary key gives you the run of student ids, the same as looking up a school id in the foreign key index would have.
edited for clarity
You seem to be suggesting storing "comma separated list of values" as a string in a character column of a table. And you say that it's "as simple as iterating the set".
But in a relational database, it turns out that "iterating the set" when its stored as list of values in a column is not at all simple. Nor is it efficient. Nor does it conform to the relational model.
Consider the operations required when a member needs to be added to a set, or removed from the set, or even just determining whether a member is in a set. Consider the operations that would be required to enforce integrity, to verify that every member in that "comma separated list" is valid. The relational database engine is not going to help us out with that, we'll have to code all of that ourselves.
At first blush, this idea may seem like a good approach. And it's entirely possible to do, and to get some code working. But once we move beyond the trivial demonstration, into the realm of real problems and real world data volumes, it turns out to be a really, really bad idea.
The storing comma separated lists is all-too-familiar SQL anti-pattern.
I strongly recommend Chapter 2 of Bill Karwin's excellent book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming ISBN-13: 978-1934356555
(The discussion here relates to "relational database" and how it is designed to operate, following the relational model, the theory developed by Ted Codd and Chris Date.)
"All nonkey columns are dependent on the key, the whole key, and nothing but the key. So help me Codd."
Q: Is there a reason this implementation is uncommon?
Yes, it's uncommon because it flies in the face of relational theory. And it makes what would be a straightforward problem (for the relational model) into a confusing jumble that the relational database can't help us with. If what we're storing is just a string of characters, and the database never needs to do anything with that, other than store the string and retrieve the string, we'd be good. But we can't ask the database to decipher that as representing relationships between entities.
Q: Are there some queries that can't be supported efficiently this way?
Any query that would need to turn that "list of values" into a set of rows to be returned would be inefficient. Any query that would need to identify a "list of values" containing a particular value would be inefficient. And operations to insert or remove a value from the "list of values" would be inefficient.
This might buy you some small benefit in a narrow set of cases. But the drawbacks are numerous.
Such indices are useful for more than just direct joins from the parent record. A query might GROUP BY the FK column, or join it to a temp table / subquery / CTE; all of these cases might benefit from the presence of an index, but none of the queries involve the parent table.
Even direct joins from the parent often involve additional constraints on the child table. Consequently, indices defined on child tables commonly include other fields in addition to the key itself.
Even if there appear to be fewer steps involved in this algorithm, that does not necessarily equate to better performance. Databases don't read from disk a column at a time; they typically load data in fixed-size blocks. As a result, storing this information in a contiguous structure may allow it to be accessed far more efficiently than scattering it across multiple tuples.
No database that I'm aware of can inline an arbitrarily large column; either you'd have a hard limit of a few thousand children, or you'd have to push this list to some out-of-line storage (and with this extra level of indirection, you've probably lost any benefit over an index lookup).
Databases are not designed for partial reads or in-place edits of a column value. You would need to fetch the entire list whenever it's accessed, and more importantly, replace the entire list whenever it's modified.
In fact, you'd need to duplicate the entire row whenever the child list changes; the MVCC model handles concurrent modifications by maintaining multiple versions of a record. And not only are you spawning more versions of the record, but each version holds its own copy of the child list.
Probably most damning is the fact that an insert on the child table now triggers an update of the parent. This involves locking the parent record, meaning that concurrent child inserts or deletes are no longer allowed.
I could go on. There might be mitigating factors or obvious solutions in many of these cases (not to mention outright misconceptions on my part), though there are probably just as many issues that I've overlooked. In any case, I'm satisfied that they've thought this through fairly well...
Related
I have a MySQL table like this, and I want to create indexes that make all queries to the table run fast. The difficult thing is that there are many possible combinations of where conditions, and that the size of table is large (about 6M rows).
Table name: items
id: PKEY
item_id: int (the id of items)
category_1: int
category_2: int
.
.
.
category_10: int
release_date: date
sort_score: decimal
item_id is not unique because an item can have several numbers of category_x .
An example of queries to this table is:
SELECT DISTINCT(item_id) FROM items WHERE category_1 IN (1, 2) AND category_5 IN (3, 4), AND release_date > '2019-01-01' ORDER BY sort_score
And another query maybe:
SELECT DISTINCT(item_id) FROM items WHERE category_3 IN (1, 2) AND category_4 IN (3, 4), AND category_8 IN (5) ORDER BY sort_score
If I want to optimize all the combinations of where conditions , do I have to make a huge number of composite indexes of the column combinations? (like ADD INDEX idx1_3_5(category_1, category_3, category_5))
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Or, is it difficult to optimize this kind of queries in MySQL, and should I use other middlewares , such as Elasticsearch ?
Well, the file (it is not a table) is not at all Normalised. Therefore no amount indices on combinations of fields will help the queries.
Second, MySQL is (a) not compliant with the SQL requirement, and (b) it does not have a Server Architecture or the features of one.
Such a Statistics, which is used by a genuine Query Optimiser, which commercial SQL platforms have. The "single index" issue you raise in the comments does not apply.
Therefore, while we can fix up the table, etc, you may never obtain the performance that you seek from the freeware.
Eg. in the commercial world, 6M rows is nothing, we worry when we get to a billion rows.
Eg. Statistics is automatic, we have to tweak it only when necessary: an un-normalised table or billions of rows.
Or ... should I use other middlewares , such as Elasticsearch ?
It depends on the use of genuine SQL vs MySQL, and the middleware.
If you fix up the file and make a set of Relational tables, the queries are then quite simple, and fast. It does not justify a middleware search engine (that builds a data cube on the client system).
If they are not fast on MySQL, then the first recommendation would be to get a commercial SQL platform instead of the freeware.
The last option, the very last, is to stick to the freeware and add a big fat middleware search engine to compensate.
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Yes. JOINs are quite ordinary in SQL. Contrary to popular mythology, a normalised database, which means many more tables than an un-normalised one, causes fewer JOINs, not more JOINs.
So, yes, Normalise that beast. Ten tables is the starting perception, still not at all Normalised. One table for each of the following would be a step in the direction of Normalised:
Item
Item_id will be unique.
Category
This is not category-1, etc, but each of the values that are in category_1, etc. You must not have multiple values in a single column, it breaks 1NF. Such values will be (a) Atomic, and (b) unique. The Relational Model demands that the rows are unique.
The meaning of category_1, etc in Item is not given. (If you provide some example data, I can improve the accuracy of the data model.) Obviously it is not [2].
.
If it is a Priority (1..10), or something similar, that the users have chosen or voted on, this table will be a table that supplies the many-to-many relationship between Item and Category, with a Priority for each row.
.
Let's call it Poll. The relevant Predicates would be something like:
Each Poll is 1 Item
Each Poll is 1 Priority
Each Poll is 1 Category
Likewise, sort_score is not explained. If it is even remotely what it appears to be, you will not need it. Because it is a Derived Value. That you should compute on the fly: once the tables are Normalised, the SQL required to compute this is straight-forward. Not one that you compute-and-store every 5 minutes or every 10 seconds.
The Relational Model
The above maintains the scope of just answering your question, without pointing out the difficulties in your file. Noting the Relational Database tag, this section deals with the Relational errors.
The Record ID field (item_id or category_id is yours) is prohibited in the Relational Model. It is a physical pointer to a record, which is explicitly the very thing that the RM overcomes, and that is required to be overcome if one wishes to obtain the benefits of the RM, such as ease of queries, and simple, straight-forward SQL code.
Conversely, the Record ID is always one additional column and one additional index, and the SQL code required for navigation becomes complex (and buggy) very quickly. You will have enough difficulty with the code as it is, I doubt you would want the added complexity.
Therefore, get rid of the Record ID fields.
The Relational Model requires that the Keys are "made up from the data". That means something from the logical row, that the users use. Usually they know precisely what identifies their data, such as a short name.
It is not manufactured by the system, such as a RecordID field which is a GUID or AUTOINCREMENT, which the user does not see. Such fields are physical pointers to records, not Keys to logical rows. Such fields are pre-Relational, pre-DBMS, 1960's Record Filing Systems, the very thing that RM superseded. But they are heavily promoted and marketed as "relational.
Relational Data Model • Initial
Looks like this.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
Relational Data Model • Improved
Ternary relations (aka three-way JOINs) are known to be a problem, indicating that further Normalisation is required. Codd teaches that every ternary relation can be reduced to two binary relations.
In your case, perhaps a Item has certain, not all, Categories. The above implements Polls of Items allowing all Categories for each Item, which is typical error in a ternary relation, which is why it requires further Normalisation. It is also the classic error in every RFS file.
The corrected model would therefore be to establish the Categories for each Item first as ItemCategory, your "item can have several numbers of category_x". And then to allow Polls on that constrained ItemCategory. Note, this level of constraining data is not possible in 1960' Record Filing Systems, in which the "key" is a fabricated id field:
Each ItemCategory is 1 Item
Each ItemCategory is 1 Category
Each Poll is 1 Priority
Each Poll is 1 ItemCategory
Your indices are now simple and straight-forward, no additional indices are required.
Likewise your query code will now be simple and straight-forward, and far less prone to bugs.
Please make sure that you learn about Subqueries. The Poll table supports any type of pivoting that may be required.
It is messy to optimize such queries against such a table. Moving the categories off to other tables would only make it slower.
Here's a partial solution... Identify the categories that are likely to be tested with
=
IN
a range, such as your example release_date > '2019-01-01'
Then devise a few indexes (perhaps no more than a dozen) that have, say, 3-4 columns. Those columns should be ones that are often tested together. Order the columns in the indexes based on the list above. It is quite fine to have multiple = columns (first), but don't include more than one 'range' (last).
Keep in mind that the order of tests in WHERE does not matter, but the order of the columns in an INDEX does.
I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!
Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs
In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId)?
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK]
In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId) ?
Yes, you can, it happens fairly often in Relational Databases.
(You have not set that tag, but since (a) you are using SQL Server, and (b) you have compound Keys, which indicates a movement in the direction of a Relational Database, I am making that assumption.)
Whether you should or not, in any particular instance, is a separate matter. And that gets into design; modelling; Normalisation.
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK] ?
Ok, so you realise that your design is not exactly robust.
That is a Normalisation question. It cannot be answered on just that pair of tables, because:
Normalisation is an overall issue, all the tables need to be taken into account, together, in the one exercise.
That exercise determines Keys. As the PKs change, the FKs in the child tables will change.
The structure you have detailed is a Record Filing System, not a set of Relational tables. It is full of duplication, and confusion (Facts1 are not clearly defined).
You appear to be making the classic mistake of stamping an ID field on every file. That (a) cripples the modelling exercise (hence the difficulties you are experiencing) and (b) guarantees a RFS instead of a RDb.
Solution
First, let me say that the level of detail in an answer is constrained to the level of detail given in the question. In this case, since you have provided great detail, I am able to make reasonable decisions about your data.
If I may, it is easier the correct the entire lot of them, than to discuss and correct one or the other pair of files.
Various files need to be Normalised ("merged" or separated)
Various duplicates fields need to be Normalised (located with the relevant Facts, such that duplication is eliminated)
Various Facts1 need to be clarified and established properly.
Please consider this:
Reservation TRD
That is an IDEF1X model, rendered at the Table-Relation level. IDEF1X is the Standard for modelling Relational Databases. Please be advised that every little tick; notch; and mark; the crows feet; the solid vs dashed lines; the square vs round corners; means something very specific and important. Refer to the IDEF1X Notation. If you do not understand the Notation, you will not be able to understand or work the model.
The Predicates are very important, I have given them for you.
If you would like to information on the important Relational concept of Predicates, and how it is used to both understand and verify the model, as well as to describe it in business terms, visit this Answer, scroll down (way down) until you find the Predicate section, and read that carefully.
Assumption
I have made the following assumptions:
Given that it is 2015, when reserving a Room, the hotel requires Credit Card details. It forms the basis for a Reservation.
Rooms exist independently. RoomId is silly, given that all Rooms are already uniquely Identified by a RoomNo. The PK is ( RoomNo ).
Clients exist independently.
The real Identifier has to be (NameLast, NameFirst, Initial ... ), plus possibly StateCode. Otherwise you will have duplicate rows which are not permitted in a Relational Database.
However, that Key is too wide to be migrated into the child tables 2, so we add 3 a surrogate ( ClientId ), make that the PK, and demote the real Identifier to an AK.
CreditCards belong to Clients, and you want them Identified just once (not on each transaction). The PK is ( ClientId, CreditCardNo ).
Reservations are for Rooms, they do not exist in isolation, independently. Therefore Reservation is a child of Room, and the PK is ( RoomNo, Date ). You can use DateTime if the rooms are not for full days, if they are for short meetings, liaisons, etc.
A Reservation may, or may not, progress to be filled. The PK is identical to the parent. This allows just one filled reservation per Reservation.
Payments do not exist in isolation either. The Payments are only for Reservations.
The Payment may be for a ReservationFee (for "no shows"), or for a filled Reservation, plus extras. I will leave it to you to work out duration changes; etc. Multiple Payments (against a Reservation) are supported.
The PK is the Identifier of the parent, Reservation, plus a sequence number: ( RoomNo, Date, SequenceNo ).
Relational Database
You now have a Relational Database, with levels of (a) Integrity (b) Power and (c) Speed, each of which is way, way, beyond the capabilities of a Record Filing System. Notice, there is just one ID column.
Note
A Database is a collection of Facts about the real world, limited to the scope that the app engages.
Which is the single reason that justifies the use of a surrogate.
A surrogate is always an addition, not a substitution. The real Keys that make the row unique cannot be abandoned.
Please feel free to ask questions or comment.
I'm trying to figure out what would be the optimal database and table structure to store relationships between nodes of the type (var)char. I've last used MySQL many years ago as a backend for some simple PHP webpages and never got beyond that. I hope some seasoned users can give me their opinion.
Let's say I have a bunch of names:
Thomas
Jane
Felix
Marc
Anne
I now want to store their relationships. My idea is to have two tables that might look like this:
names (id, name) relationships (id_1, id_2)
0 Thomas 0 1
1 Jane 0 3
2 Felix 1 2
3 Marc 3 4
4 Anne ...
...
The scope of the data is as follows:
Table 'names' will contain approx. 5 million rows.
Table 'relationships' will contain 150-200 million rows.
The database will only be accessed by me, locally (server and client are the same machine)
I don't need responsiveness as with a web server, only a high throughput during the few occasions when I access it (to reduce waiting time)
My questions are:
I recall proper use of PRIMARY_KEY being important. I vaguely remember there being the possibility to assign the key to two columns (i.e. id_1, id_2 in my case); this helps querying I imagine?
Is there a way from within MySQL to prevent the creation of duplicate relationships (e.g. 0:4 & 4:0) during insertion?
MySQL defaults to InnoDB for me. Is this the database you would recommend for my scenario?
Any pointers welcome. Thank you.
Firstly, you need to consider whether your relationships have a "direction" associated with them. For example, the relationship "is a child of" has the opposite direction to the otherwise identical relationship "is a parent of"; on the other hand, the relationship "is a sibling of" is undirected (or bidirectional, depending on one's point of view).
The structure you describe is perfect for directed relationships.
Bidirectional relationships, on the other hand, are often best represented by deliberately performing the duplication described in your second bulletpoint; whilst this consumes more storage, it greatly simplifies queries such as "find all siblings of X"—which might otherwise have to take the union of two separate queries:
SELECT id_2 FROM my_table WHERE id_1=X
UNION
SELECT id_1 FROM my_table WHERE id_2=X
Because there is no index on the resulting column, these sorts of queries can be quite slow if one wants to do something more with the result (such as sort by id, or join with thenames table—albeit in that particular case one could perform the joins before the union, but that just increases redundancy and complexity in one's data manipulation code).
One can use triggers to ensure that whenever a relationship is written (inserted, updated or deleted) to a table that represents bidirectional relationships, the same operation is automatically performed on the reverse relationship.
Secondly, the representation you describe is known as an "adjacency list", which is very simple and easy to understand. But it's not great at dealing with deep searches through the data hierarchy, especially on MySQL (which, unlike some other RDBMS, doesn't support recursive functions). Thus finding "all descendants of X" or "all ancestors of Y" is actually quite difficult. Other data models, such as "nested sets" or "transitive closure" are much better for these tasks.
With that preamble said, on to your questions:
I recall proper use of PRIMARY_KEY being important. I vaguely remember there being the possibility to assign the key to two columns (i.e. id_1, id_2 in my case); this helps querying I imagine?
There are four possible primary keys for your relationship table:
(id_1)
(id_2)
(id_1, id_2)
(id_2, id_1)
By definition, a primary key must be unique within your table. Indeed it is the primary means of identifying a record. But if desired one can also define further UNIQUE keys, which have the same constraining effect as a primary key (the differences are relatively minor and beyond the scope of this answer): thus one can actually enforce any combination of the above constraints.
The above constraints would respectively: limit each name to being on one side of the relationship no more than once; limit each name to being on the other side of the relationship no more than once; and the final two limit each combination of names to being within the same relationship no more than once (the difference is merely the order in which the index is stored). If the table represents undirected relationships, then obviously the second and fourth constraints are semantically equivalent to the first and third constraints respectively.
Some examples:
if your table represents "id_1 is the genetic father of id_2" then id_1 might have many children. So (id_1) cannot be the primary key, as it won't uniquely identify records of fathers who have more than one child. On the other hand id_2 can only have a single genetic father (embryological advances aside), so (id_2) will uniquely identify a record and can be the primary key (that said, many-to-one relationships of this sort might as well be modelled via a father_id column in the names table). The other two (composite) keys would permit children to have many fathers and must therefore be incorrect.
if your table represents "id_1 is a parent of id_2" then both a parent can have many children and children can have more than one parent (this is known as a many-to-many relationship). Therefore the first two constraints are incorrect and one must choose between the latter two (as mentioned previously, the difference is merely the order in which the index is stored—so MySQL must locate the first column before it can lookup the second). Incidentally, in this case one might consider adding an additional column to the relationship table that indicates which parent the relationship represents; if a child can only have one parent of each type, then one could define the primary key as (child_id, parent_type).
if your table represents "id_1 and id_2 are married" then both (id_1) and (id_2) are "candidate keys", because noone can be married to more than one other person (at least in the UK, polygamy aside). Thus one might define (id_1) as the primary key and define a second UNIQUE key over (id_2). As mentioned before, one may well wish to place the records inside the table both ways around—and these constraints will not prevent that.
Is there a way from within MySQL to prevent the creation of duplicate relationships (e.g. 0:4 & 4:0) during insertion?
Yes, one can do so with triggers: but note what was been said above regarding bidirectional relationships (where such "duplicates" are often desired). An example of trigger that would enforce this type of constraint might be:
CREATE TRIGGER rel_ins BEFORE INSERT ON relationships FOR EACH ROW
IF EXISTS (
SELECT * FROM relationships WHERE id_1=NEW.id_2 AND id_2=NEW.id_1
) THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Reverse relationship already exists';
END IF;;
One may also want a similar trigger "before update".
A situation where a constraint of this sort might be desirable would be where the table represents "is a parent of", since a parent cannot be their child's child (however, in this case it may be worth noting that in such a relationship table, one may actually wish to go further and prevent all circularities—e.g. prevent a child from being the parent of their grandparent). Again, "adjacency list" is not the best model for enforcing this sort of constraint—"nested sets", on the other hand, entirely prevent all circularities purely by virtue of their structure.
MySQL defaults to InnoDB for me. Is this the database you would recommend for my scenario?
The biggest advantage of InnoDB is that it is fully ACID compliant, and thus offers transactional support. This is especially useful if you might write to the database from multiple places at one time. If you're simply going to perform a one-time-load of a bunch of static data into the database for subsequent querying, it may well be a little slower than MyISAM.
I am going to build a PHP web application and have already decided to use Codeigniter + DataMapper (OverZealous edition).
I have discovered that DataMapper (OverZealous edition) requires the use of an extra association table even when there is actually just a one-to-many relationship.
For example, a country can have many players but a player can only belong to one country. Typical database design would be like this:
[countries] country_id(pk), country_name
[players] player_id(pk), player_name, country_id(fk)
However, in DataMapper, it requires the design to be like this:
[countries] country_id(pk), country_name
[players] player_id(pk), player_name
[asso_countries_players] countries_players_id(pk), country_id(fk), player_id(fk)
It's good for maintenance because if later we change our mind that a player can belong to more than one country, it can be done with very little effort.
But what I would like to know is, for such database design, in general, is there any performance gain or loss when compared to the typical design?
"The fastest way to do anything is not to do it at all." -- Cary Millsap, Optimizing Oracle Performance.
"A designer knows he has achieved true elegance not when there is nothing left to add, but when there is nothing left to take away." -- Antoine de Saint-Exupéry
The simpler implementation has two tables and three indexes, two unique.
The more complicated implementation has three tables and five indexes, four unique. The index on asso_countries_players.player_name (which should be a surrogate ID -- what happens if a player's name changes, like if they get married or legally change it, as Chad Ochocinco (nee Johnson) did?) must also be unique to enforce the 0..1 nature of the relationship between players and countries.
If the associative entity isn't required by the data model, then eliminate it. It's generally pretty trivial to transform a 0..1 relationship or 1..n relationship to an n..n relationship:
Add associative entity (and I'd question the need for a surrogate key there unless the relationship itself had attributes, like a start or end date)
Populate associative entity with existing data
Reimplement the foreign key constraints
Remove superseded foreign key column in child table.
Selecting data and searching will mean more joins : you'll have to work on 3 tables instead of 2.
Inserting data will mean more insert queries : you'll have to insert to 3 tables instead of 2.
So I'm guessing this could mean a bit more work -- which, in turns, might hurt performances a bit.
Because this is one-to-many I'd personally not use an association table, it's totally unnecessary.
The performance hit from this decision won't be too great. But think about the context of your data too, don't just do it because some program tells you - understand your data.