SQL Server 2008: can 2 tables have the same composite primary key? - sql-server-2008

In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId)?
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK]

In this case, tables Reserve_details and Payment_details; can the 2 tables have the same composite primary key (clientId, roomId) ?
Yes, you can, it happens fairly often in Relational Databases.
(You have not set that tag, but since (a) you are using SQL Server, and (b) you have compound Keys, which indicates a movement in the direction of a Relational Database, I am making that assumption.)
Whether you should or not, in any particular instance, is a separate matter. And that gets into design; modelling; Normalisation.
Or should I merge the 2 tables so they become one:
clientId[PK], roomId[PK], reserveId[FK], paymentId[FK] ?
Ok, so you realise that your design is not exactly robust.
That is a Normalisation question. It cannot be answered on just that pair of tables, because:
Normalisation is an overall issue, all the tables need to be taken into account, together, in the one exercise.
That exercise determines Keys. As the PKs change, the FKs in the child tables will change.
The structure you have detailed is a Record Filing System, not a set of Relational tables. It is full of duplication, and confusion (Facts1 are not clearly defined).
You appear to be making the classic mistake of stamping an ID field on every file. That (a) cripples the modelling exercise (hence the difficulties you are experiencing) and (b) guarantees a RFS instead of a RDb.
Solution
First, let me say that the level of detail in an answer is constrained to the level of detail given in the question. In this case, since you have provided great detail, I am able to make reasonable decisions about your data.
If I may, it is easier the correct the entire lot of them, than to discuss and correct one or the other pair of files.
Various files need to be Normalised ("merged" or separated)
Various duplicates fields need to be Normalised (located with the relevant Facts, such that duplication is eliminated)
Various Facts1 need to be clarified and established properly.
Please consider this:
Reservation TRD
That is an IDEF1X model, rendered at the Table-Relation level. IDEF1X is the Standard for modelling Relational Databases. Please be advised that every little tick; notch; and mark; the crows feet; the solid vs dashed lines; the square vs round corners; means something very specific and important. Refer to the IDEF1X Notation. If you do not understand the Notation, you will not be able to understand or work the model.
The Predicates are very important, I have given them for you.
If you would like to information on the important Relational concept of Predicates, and how it is used to both understand and verify the model, as well as to describe it in business terms, visit this Answer, scroll down (way down) until you find the Predicate section, and read that carefully.
Assumption
I have made the following assumptions:
Given that it is 2015, when reserving a Room, the hotel requires Credit Card details. It forms the basis for a Reservation.
Rooms exist independently. RoomId is silly, given that all Rooms are already uniquely Identified by a RoomNo. The PK is ( RoomNo ).
Clients exist independently.
The real Identifier has to be (NameLast, NameFirst, Initial ... ), plus possibly StateCode. Otherwise you will have duplicate rows which are not permitted in a Relational Database.
However, that Key is too wide to be migrated into the child tables 2, so we add 3 a surrogate ( ClientId ), make that the PK, and demote the real Identifier to an AK.
CreditCards belong to Clients, and you want them Identified just once (not on each transaction). The PK is ( ClientId, CreditCardNo ).
Reservations are for Rooms, they do not exist in isolation, independently. Therefore Reservation is a child of Room, and the PK is ( RoomNo, Date ). You can use DateTime if the rooms are not for full days, if they are for short meetings, liaisons, etc.
A Reservation may, or may not, progress to be filled. The PK is identical to the parent. This allows just one filled reservation per Reservation.
Payments do not exist in isolation either. The Payments are only for Reservations.
The Payment may be for a ReservationFee (for "no shows"), or for a filled Reservation, plus extras. I will leave it to you to work out duration changes; etc. Multiple Payments (against a Reservation) are supported.
The PK is the Identifier of the parent, Reservation, plus a sequence number: ( RoomNo, Date, SequenceNo ).
Relational Database
You now have a Relational Database, with levels of (a) Integrity (b) Power and (c) Speed, each of which is way, way, beyond the capabilities of a Record Filing System. Notice, there is just one ID column.
Note
A Database is a collection of Facts about the real world, limited to the scope that the app engages.
Which is the single reason that justifies the use of a surrogate.
A surrogate is always an addition, not a substitution. The real Keys that make the row unique cannot be abandoned.
Please feel free to ask questions or comment.

Related

How to design table with primary key, index, unique in SQL [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Implementing efficient foreign keys in a relational database

All popular SQL databases, that I am aware of, implement foreign keys efficiently by indexing them.
Assuming a N:1 relationship Student -> School, the school id is stored in the student table with a (sometimes optional) index. For a given student you can find their school just looking up the school id in the row, and for a given school you can find its students by looking up the school id in the index over the foreign key in Students. Relational databases 101.
But is that the only sensible implementation? Imagine you are the database implementer, and instead of using a btree index on the foreign key column, you add an (invisible to the user) set on the row at the other (many) end of the relation. So instead of indexing the school id column in students, you had an invisible column that was a set of student ids on the school row itself. Then fetching the students for a given school is a simple as iterating the set. Is there a reason this implementation is uncommon? Are there some queries that can't be supported efficiently this way? The two approaches seem more or less equivalent, modulo particular implementation details. It seems to me you could emulate either solution with the other.
In my opinion it's conceptually the same as splitting of the btree, which contains sorted runs of (school_id, student_row_id), and storing each run on the school row itself. Looking up a school id in the school primary key gives you the run of student ids, the same as looking up a school id in the foreign key index would have.
edited for clarity
You seem to be suggesting storing "comma separated list of values" as a string in a character column of a table. And you say that it's "as simple as iterating the set".
But in a relational database, it turns out that "iterating the set" when its stored as list of values in a column is not at all simple. Nor is it efficient. Nor does it conform to the relational model.
Consider the operations required when a member needs to be added to a set, or removed from the set, or even just determining whether a member is in a set. Consider the operations that would be required to enforce integrity, to verify that every member in that "comma separated list" is valid. The relational database engine is not going to help us out with that, we'll have to code all of that ourselves.
At first blush, this idea may seem like a good approach. And it's entirely possible to do, and to get some code working. But once we move beyond the trivial demonstration, into the realm of real problems and real world data volumes, it turns out to be a really, really bad idea.
The storing comma separated lists is all-too-familiar SQL anti-pattern.
I strongly recommend Chapter 2 of Bill Karwin's excellent book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming ISBN-13: 978-1934356555
(The discussion here relates to "relational database" and how it is designed to operate, following the relational model, the theory developed by Ted Codd and Chris Date.)
"All nonkey columns are dependent on the key, the whole key, and nothing but the key. So help me Codd."
Q: Is there a reason this implementation is uncommon?
Yes, it's uncommon because it flies in the face of relational theory. And it makes what would be a straightforward problem (for the relational model) into a confusing jumble that the relational database can't help us with. If what we're storing is just a string of characters, and the database never needs to do anything with that, other than store the string and retrieve the string, we'd be good. But we can't ask the database to decipher that as representing relationships between entities.
Q: Are there some queries that can't be supported efficiently this way?
Any query that would need to turn that "list of values" into a set of rows to be returned would be inefficient. Any query that would need to identify a "list of values" containing a particular value would be inefficient. And operations to insert or remove a value from the "list of values" would be inefficient.
This might buy you some small benefit in a narrow set of cases. But the drawbacks are numerous.
Such indices are useful for more than just direct joins from the parent record. A query might GROUP BY the FK column, or join it to a temp table / subquery / CTE; all of these cases might benefit from the presence of an index, but none of the queries involve the parent table.
Even direct joins from the parent often involve additional constraints on the child table. Consequently, indices defined on child tables commonly include other fields in addition to the key itself.
Even if there appear to be fewer steps involved in this algorithm, that does not necessarily equate to better performance. Databases don't read from disk a column at a time; they typically load data in fixed-size blocks. As a result, storing this information in a contiguous structure may allow it to be accessed far more efficiently than scattering it across multiple tuples.
No database that I'm aware of can inline an arbitrarily large column; either you'd have a hard limit of a few thousand children, or you'd have to push this list to some out-of-line storage (and with this extra level of indirection, you've probably lost any benefit over an index lookup).
Databases are not designed for partial reads or in-place edits of a column value. You would need to fetch the entire list whenever it's accessed, and more importantly, replace the entire list whenever it's modified.
In fact, you'd need to duplicate the entire row whenever the child list changes; the MVCC model handles concurrent modifications by maintaining multiple versions of a record. And not only are you spawning more versions of the record, but each version holds its own copy of the child list.
Probably most damning is the fact that an insert on the child table now triggers an update of the parent. This involves locking the parent record, meaning that concurrent child inserts or deletes are no longer allowed.
I could go on. There might be mitigating factors or obvious solutions in many of these cases (not to mention outright misconceptions on my part), though there are probably just as many issues that I've overlooked. In any case, I'm satisfied that they've thought this through fairly well...

Not defined database schema

I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!
Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs

Spreading/distributing an entity into multiple tables instead of a single on

Why would anyone distribute an entity (for example user) into multiple tables by doing something like:
user(user_id, username)
user_tel(user_id, tel_no)
user_addr(user_id, addr)
user_details(user_id, details)
Is there any speed-up bonus you get from this DB design? It's highly counter-intuitive, because it would seem that performing chained joins to retrieve data sounds immeasurably worse than using select projection..
Of course, if one performs other queries by making use only of the user_id and username, that's a speed-up, but is it worth it? So, where is the real advantage and what could be a compatible working scenario that's fit for such a DB design strategy?
LATER EDIT: in the details of this post, please assume a complete, unique entity, whose attributes do not vary in quantity (e.g. a car has only one color, not two, a user has only one username/social sec number/matriculation number/home address/email/etc.. that is, we're not dealing with a one to many relation, but with a 1-to-1, completely consistent description of an entity. In the example above, this is just the case where a single table has been "split" into as many tables as non-primary key columns it had.
By splitting the user in this way you have exactly 1 row in user per user, which links to 0-n rows each in user_tel, user_details, user_addr
This in turn means that these can be considered optional, and/or each user may have more than one telephone number linked to them. All in all it's a more adaptable solution than hardcoding it so that users always have up to 1 address, up to 1 telephone number.
The alternative method is to have i.e. user.telephone1 user.telephone2 etc., however this methodology goes against 3NF ( http://en.wikipedia.org/wiki/Third_normal_form ) - essentially you are introducing a lot of columns to store the same piece of information
edit
Based on the additional edit from OP, assuming that each user will have precisely 0 or 1 of each tel, address, details, and NEVER any more, then storing those pieces of information in separate tables is overkill. It would be more sensible to store within a single user table with columns user_id, username, tel_no, addr, details.
If memory serves this is perfectly fine within 3NF though. You stated this is not about normal form, however if each piece of data is considered directly related to that specific user then it is fine to have it within the table.
If you later expanded the table to have telephone1, telephone2 (for example) then that would violate 1NF. If you have duplicate fields (i.e. multiple users share an address, which is entirely plausible), then that violates 2NF which in turn violates 3NF
This point about violating 2NF may well be why someone has done this.
The author of this design perhaps thought that storing NULLs could be achieved more efficiently in the "sparse" structure like this, than it would "in-line" in the single table. The idea was probably to store rows such as (1 , "john", NULL, NULL, NULL) just as (1 , "john") in the user table and no rows at all in other tables. For this to work, NULLs must greatly outnumber non-NULLs (and must be "mixed" in just the right way), otherwise this design quickly becomes more expensive.
Also, this could be somewhat beneficial if you'll constantly SELECT single columns. By splitting columns into separate tables, you are making them "narrower" from the storage perspective and lower the I/O in this specific case (but not in general).
The problems of this design, in my opinion, far outweigh these benefits.

For storing people in MySQL (or any DB) - multiple tables or just one?

Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.