Not defined database schema - mysql

I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!

Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs

Related

Multiple foreign key columns vs multiple join tables

This is yet another database normalization question/discussion but i'm hoping to get a some additional perspective on the trade offs, advantages, disadvantage of different scenerios for multiple foreign key columns vs multiple join/intersection tables as I can't seem to find any practical information or advice for how MYSQL would optimize or fail on different approaches.
I'm asking as a general guidance for how others approach for objects with multiple 1:N relationships and foreign keys but a majority of them will always be null.
As a basic example let's say I have a Project Management app with a uploads table for storing upload files information. For "scale" theres 20 Million current uploads, with 1000 added daily.
Uploads can have a direct relation to a couple different objects as it's "parent" or "owner". Directly to a Project, directly to a Todo, and directly to a Comment. Each upload would only ever have a single relationship at a time never multiple.
Potential options for structuring I see
Option 1: Single table multiple foreign key columns
uploads
upload_id, filepath, project_id, todo_id, comment_id
foreign keys for project_id, todo_id, comment_id.
Potential Problem: Large amount of null values in foreign keys. Potentially slow writes/locks in high volumes due to fk constraints and the additional index sizes.
Option 2: Multiple Intersection/Join tables
uploads
upload_id, filepath
project_uploads
project_id, upload_id
todo_uploads
todo_id, upload_id
comment_uploads
comment_id, upload_id
foreign keys on all columns for *_uploads tables
Potential Problem: People will confuse for N:N instead of 1:N relationship. "Relative", but more difficult selects to produce in application layer, especially when selecting uploads for projects as you would need to join all tables to get the entire list of project Ids for the uploads since todos and comments both would also belong to a parent.
Option 3: Single Relation/Join table with a type
uploads
upload_id, filepath
objects_uploads
upload_id, object_id, type
foreign key on upload_id, standard indexes on object_id and type.
Potential Problem: more confusing schema, not truely "relational" or normalized
I'd also like to throw out the potential of using JSON fields on individual objects and just always enforcing project_id on the uploads. I have very limited experience with JSON field types or their pitfalls. I'm assuming selections to get uploads specifically parented/uploaded to a todo or comment would be far more difficult as you would need to select the ids out of the json.
Are there any other approaches or considerations I'm overlooking? Are their specific consideration to follow based on different workloads, higher write volumes, high read, etc. Thanks for any information, insights or resources.
Edit
To clarify, I understand that the above outlines can represent differences in schemes/relationships of the objects. I'm really only thinking about write and select performance and considerations or tradeoffs to make around indexes/constraints and joins.Specifically for this question i'm less concerned about referential integrity or 100% data integrity.
I've modified some language in the my original example above. I'm looking for ideal configurations for objects that can be related to many different objects as but never at the same time leaving most foreign key columns null. Here is a similar questions from 3.5 years ago...
https://softwareengineering.stackexchange.com/questions/335284/disadvantages-of-using-a-nullable-foreign-key-instead-of-creating-an-intersectio
Basically trying to find some general advice when to consider or optimize in the different ways, gauge any real impact of large amount of nulls in Foreign keys and potential reasons for when to prefer different approaches.
Option 1 models three many-to-one relationships. That is, a given upload can have at most one reference to project, at most one reference to todo, and at most one reference to comment. This would be a simple way to model these as many-to-one relationships. Don't shy away from using NULLs, they don't take storage space.
Option 2 models three many-to-many relationships. A given upload may be associated with multiple projects, multiple todos, and multiple comments. I think this is what Akina was commenting about above. If your application needs these to be many-to-many relationships, then you need these three intersection tables to model that data. If you don't need these to be many-to-many relationships, then don't create these tables.
Option 3 is not a relational data model at all. It conflicts with several normal forms.

Polymorphic relationships vs separate tables per type

I am working on a database which has some types (e.g. User, Appointment, Task etc.) which can have zero or more Notes associated with each type.
The possible solutions I have come across for implementing these relationships are:
Polymorphic relationship
Separate table per type
Polymorphic Relationship
Suggested by many as being the easiest solution to implement and seemingly the most common implementation for frameworks that follow the Active Record pattern, I would add a table whose data is morphable:
My notable_type would allow me to distinguish between the type (User, Appointment, Task) the Note relates to, whilst the notable_id would allow me to obtain the individual type record in the related type table.
PROS:
Easy to scale, more models can be easily associated with the polymorphic class
Limits table bloat
Results in one class that can be used by many other classes (DRY)
CONS
More types can make querying more difficult and expensive as the data grows
Cannot have a foreign key
Lack of data consistency
Separate Table per Type
Alternatively I could create a table for each type which is responsible for the Notes associated with that type only. The type_id foreign key would allow me to quickly obtain the individual type record.
Deemed by many online as a code smell, many articles advocate avoiding the polymorphic relationship in favour of an alternative (here and here for example).
PROS:
Allows us to use foreign keys effectively
Efficient data querying
Maintains data consistency
CONS:
Increases table bloat as each type requires a separate table
Results in multiple classes, each representing the separate type_notes table
Thoughts
The polymorphic relationship is certainly the simpler of the two options to implement, but the lack of foreign key constraints and therefore potential for consistency issues feels wrong.
A table per notes relationship (user_notes, task_notes etc.) with foreign keys seems the correct way (in keeping with design patterns) but could result in a lot of tables (addition of other types that can have notes or addition of types similar to notes [e.g. events]).
It feels like my choice is either simplified table structure but forgo foreign keys and increased query overhead, or increase the number of tables with the same structure but simplify queries and allow for foreign keys.
Given my scenario which of the above would be more appropriate, or is there an alternative I should consider?
What is "table bloat"? Are you concerned about having too many tables? Many real-world databases I've worked on have between 100 and 200 tables, because that's what it takes.
If you're concerned with adding multiple tables, then why do you have separate tables for User, Appointment, and Task? If you had a multi-valued attribute for User, for example for multiple phone numbers per user, would you create a separate table for phones, or would you try to combine them all into the user table somehow? Or have a polymorphic "things that belong to other things" table for user phones, appointment invitees, and task milestones?
Answer: No, you'd create a Phone table, and use it to reference only the User table. If Appointments have invitees, that gets its own table (probably a many-to-many between appointments and users). If tasks have milestones, that gets its own table too.
The correct thing to do is to model your database tables like you would model object types in your application. You might like to read a book like SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date to learn more about how tables are analogous to types.
You already know instinctively that the fact that you can't create a foreign key is a red flag. A foreign key must reference exactly one parent table. This should be a clue that it's not valid relational database design to make a polymorphic foreign key. Once you start thinking of tables and their attributes as concrete types (like described in SQL and Relational Theory), this will become obvious.
If you must create one notes table, you could make it reference one table called "Notable" which is like a superclass of User, Appointment, and Task. Then each of those three tables would also reference a primary key of Notable. This mimics the object-oriented structure of polymorphism, where you can have a class Note have a reference to an object by its superclass type.
But IMHO, that's more complex than it needs to be. I would just create separate tables for UserNotes, AppointmentNotes, and TaskNotes. I'm not troubled by having three more tables, and it makes your code more clear and maintainable.
I think you should think about these two things, before you can make a decision.
Performance. a lot of reads, a lot of writes ? Test which is better.
Growth of your model. Can it easily be expanded ?

Implementing efficient foreign keys in a relational database

All popular SQL databases, that I am aware of, implement foreign keys efficiently by indexing them.
Assuming a N:1 relationship Student -> School, the school id is stored in the student table with a (sometimes optional) index. For a given student you can find their school just looking up the school id in the row, and for a given school you can find its students by looking up the school id in the index over the foreign key in Students. Relational databases 101.
But is that the only sensible implementation? Imagine you are the database implementer, and instead of using a btree index on the foreign key column, you add an (invisible to the user) set on the row at the other (many) end of the relation. So instead of indexing the school id column in students, you had an invisible column that was a set of student ids on the school row itself. Then fetching the students for a given school is a simple as iterating the set. Is there a reason this implementation is uncommon? Are there some queries that can't be supported efficiently this way? The two approaches seem more or less equivalent, modulo particular implementation details. It seems to me you could emulate either solution with the other.
In my opinion it's conceptually the same as splitting of the btree, which contains sorted runs of (school_id, student_row_id), and storing each run on the school row itself. Looking up a school id in the school primary key gives you the run of student ids, the same as looking up a school id in the foreign key index would have.
edited for clarity
You seem to be suggesting storing "comma separated list of values" as a string in a character column of a table. And you say that it's "as simple as iterating the set".
But in a relational database, it turns out that "iterating the set" when its stored as list of values in a column is not at all simple. Nor is it efficient. Nor does it conform to the relational model.
Consider the operations required when a member needs to be added to a set, or removed from the set, or even just determining whether a member is in a set. Consider the operations that would be required to enforce integrity, to verify that every member in that "comma separated list" is valid. The relational database engine is not going to help us out with that, we'll have to code all of that ourselves.
At first blush, this idea may seem like a good approach. And it's entirely possible to do, and to get some code working. But once we move beyond the trivial demonstration, into the realm of real problems and real world data volumes, it turns out to be a really, really bad idea.
The storing comma separated lists is all-too-familiar SQL anti-pattern.
I strongly recommend Chapter 2 of Bill Karwin's excellent book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming ISBN-13: 978-1934356555
(The discussion here relates to "relational database" and how it is designed to operate, following the relational model, the theory developed by Ted Codd and Chris Date.)
"All nonkey columns are dependent on the key, the whole key, and nothing but the key. So help me Codd."
Q: Is there a reason this implementation is uncommon?
Yes, it's uncommon because it flies in the face of relational theory. And it makes what would be a straightforward problem (for the relational model) into a confusing jumble that the relational database can't help us with. If what we're storing is just a string of characters, and the database never needs to do anything with that, other than store the string and retrieve the string, we'd be good. But we can't ask the database to decipher that as representing relationships between entities.
Q: Are there some queries that can't be supported efficiently this way?
Any query that would need to turn that "list of values" into a set of rows to be returned would be inefficient. Any query that would need to identify a "list of values" containing a particular value would be inefficient. And operations to insert or remove a value from the "list of values" would be inefficient.
This might buy you some small benefit in a narrow set of cases. But the drawbacks are numerous.
Such indices are useful for more than just direct joins from the parent record. A query might GROUP BY the FK column, or join it to a temp table / subquery / CTE; all of these cases might benefit from the presence of an index, but none of the queries involve the parent table.
Even direct joins from the parent often involve additional constraints on the child table. Consequently, indices defined on child tables commonly include other fields in addition to the key itself.
Even if there appear to be fewer steps involved in this algorithm, that does not necessarily equate to better performance. Databases don't read from disk a column at a time; they typically load data in fixed-size blocks. As a result, storing this information in a contiguous structure may allow it to be accessed far more efficiently than scattering it across multiple tuples.
No database that I'm aware of can inline an arbitrarily large column; either you'd have a hard limit of a few thousand children, or you'd have to push this list to some out-of-line storage (and with this extra level of indirection, you've probably lost any benefit over an index lookup).
Databases are not designed for partial reads or in-place edits of a column value. You would need to fetch the entire list whenever it's accessed, and more importantly, replace the entire list whenever it's modified.
In fact, you'd need to duplicate the entire row whenever the child list changes; the MVCC model handles concurrent modifications by maintaining multiple versions of a record. And not only are you spawning more versions of the record, but each version holds its own copy of the child list.
Probably most damning is the fact that an insert on the child table now triggers an update of the parent. This involves locking the parent record, meaning that concurrent child inserts or deletes are no longer allowed.
I could go on. There might be mitigating factors or obvious solutions in many of these cases (not to mention outright misconceptions on my part), though there are probably just as many issues that I've overlooked. In any case, I'm satisfied that they've thought this through fairly well...

SQL Server 2008: can 2 tables have the same composite primary key?

In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId)?
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK]
In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId) ?
Yes, you can, it happens fairly often in Relational Databases.
(You have not set that tag, but since (a) you are using SQL Server, and (b) you have compound Keys, which indicates a movement in the direction of a Relational Database, I am making that assumption.)
Whether you should or not, in any particular instance, is a separate matter. And that gets into design; modelling; Normalisation.
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK] ?
Ok, so you realise that your design is not exactly robust.
That is a Normalisation question. It cannot be answered on just that pair of tables, because:
Normalisation is an overall issue, all the tables need to be taken into account, together, in the one exercise.
That exercise determines Keys. As the PKs change, the FKs in the child tables will change.
The structure you have detailed is a Record Filing System, not a set of Relational tables. It is full of duplication, and confusion (Facts1 are not clearly defined).
You appear to be making the classic mistake of stamping an ID field on every file. That (a) cripples the modelling exercise (hence the difficulties you are experiencing) and (b) guarantees a RFS instead of a RDb.
Solution
First, let me say that the level of detail in an answer is constrained to the level of detail given in the question. In this case, since you have provided great detail, I am able to make reasonable decisions about your data.
If I may, it is easier the correct the entire lot of them, than to discuss and correct one or the other pair of files.
Various files need to be Normalised ("merged" or separated)
Various duplicates fields need to be Normalised (located with the relevant Facts, such that duplication is eliminated)
Various Facts1 need to be clarified and established properly.
Please consider this:
Reservation TRD
That is an IDEF1X model, rendered at the Table-Relation level. IDEF1X is the Standard for modelling Relational Databases. Please be advised that every little tick; notch; and mark; the crows feet; the solid vs dashed lines; the square vs round corners; means something very specific and important. Refer to the IDEF1X Notation. If you do not understand the Notation, you will not be able to understand or work the model.
The Predicates are very important, I have given them for you.
If you would like to information on the important Relational concept of Predicates, and how it is used to both understand and verify the model, as well as to describe it in business terms, visit this Answer, scroll down (way down) until you find the Predicate section, and read that carefully.
Assumption
I have made the following assumptions:
Given that it is 2015, when reserving a Room, the hotel requires Credit Card details. It forms the basis for a Reservation.
Rooms exist independently. RoomId is silly, given that all Rooms are already uniquely Identified by a RoomNo. The PK is ( RoomNo ).
Clients exist independently.
The real Identifier has to be (NameLast, NameFirst, Initial ... ), plus possibly StateCode. Otherwise you will have duplicate rows which are not permitted in a Relational Database.
However, that Key is too wide to be migrated into the child tables 2, so we add 3 a surrogate ( ClientId ), make that the PK, and demote the real Identifier to an AK.
CreditCards belong to Clients, and you want them Identified just once (not on each transaction). The PK is ( ClientId, CreditCardNo ).
Reservations are for Rooms, they do not exist in isolation, independently. Therefore Reservation is a child of Room, and the PK is ( RoomNo, Date ). You can use DateTime if the rooms are not for full days, if they are for short meetings, liaisons, etc.
A Reservation may, or may not, progress to be filled. The PK is identical to the parent. This allows just one filled reservation per Reservation.
Payments do not exist in isolation either. The Payments are only for Reservations.
The Payment may be for a ReservationFee (for "no shows"), or for a filled Reservation, plus extras. I will leave it to you to work out duration changes; etc. Multiple Payments (against a Reservation) are supported.
The PK is the Identifier of the parent, Reservation, plus a sequence number: ( RoomNo, Date, SequenceNo ).
Relational Database
You now have a Relational Database, with levels of (a) Integrity (b) Power and (c) Speed, each of which is way, way, beyond the capabilities of a Record Filing System. Notice, there is just one ID column.
Note
A Database is a collection of Facts about the real world, limited to the scope that the app engages.
Which is the single reason that justifies the use of a surrogate.
A surrogate is always an addition, not a substitution. The real Keys that make the row unique cannot be abandoned.
Please feel free to ask questions or comment.

Can I have one million tables in my database?

Would there be any advantages/disadvantages to having one million tables in my database.
I am trying to implement comments. So far, I can think of two ways to do this:
1. Have all comments from all posts in 1 table.
2. Have a separate table for each post and store all comments from that post in it's respective table.
Which one would be better?
Thanks
You're better off having one table for comments, with a field that identifies which post id each comment belongs to. It will be a lot easier to write queries to get comments for a given post id if you do this, as you won't first need to dynamically determine the name of the table you're looking in.
I can only speak for MySQL here (not sure how this works in Postgresql) but make sure you add an index on the post id field so the queries run quickly.
You can have a million tables but this might not be ideal for a number of reasons[*]. Classical RDBMS are typically deployed & optimised for storing millions/billions of rows in hundreds/thousands of tables.
As for the problem you're trying to solve, as others state, use foreign keys to relate a pair of tables: posts & comments a la [MySQL syntax]:
create table post(id integer primary key, post text);
create table comment(id integer primary key, postid integer , comment text, key fk (postid));
{you can add constraints to enforce referential integrity between comment and posts to avoid orphaned comments but this requires certain capabilities of the storage engine to be effective}
The generation of primary key IDs is left to the reader, but something as simple as auto increment might give you a quick start [http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html].
Which is better?
Unless this is a homework assignment, storing this kind of material in a classic RDBMS might not fit with contemporary idioms. Keep the same spiritual schema and use something like SOLR/Elasticsearch to store your material and benefit from the content indexing since I trust that you'll want to avoid writing your own search engine? You can use something like sphinx [http://sphinxsearch.com] to index MySQL in an equal manner.
[*] Without some unconventional structuring of your schema, the amount of metadata and pressure on the underlying filesystem will be problematic (for example some dated/legacy storage engines, like MyISAM on MySQL will create three files per table).
When working with relational databases, you have to understand (a little bit about) normalization. The third normal form (3NF) is easy to understand and works in almost any case. A short tutorial can be found here. Use Google if need more/other/better examples.
One table per record is a red light, you know you're missing something. It also means you need dynamic DDL, you must create new tables when you have new records. This is also a security issue, the database user needs to many permissions and becomes a security risk.