Most efficient implementation of a 1-to-n map in MySQL [duplicate] - mysql

This question already has answers here:
MySQL storing undirected graph edges efficiently
(2 answers)
Closed 6 years ago.
I have a database of information organized by US zipcode. I am building algorithms that crawls along adjacent zipcodes to determine the size of a 'city' based on density, job characteristics, or whatever. I use the location and area of any zip code to estimate which other zipcodes are adjacent to it. I am becoming aware that this algorithm is eating up most of my processing time when I run test of my program.
So what I want to do is have the map (as in a data structure map) of adjacent zipcodes in a table in my database.
My current implementation is that I have a table with two fields, source and target. Each time my algorithm determines two zipcodes are adjacent, the two codes are inserted both way into the table, as so:
+-----------+------------+
| source | target |
+-----------+------------+
| 02139 | 02138 |
| 02138 | 02139 |
+-----------+------------+
That way I can search for all adjacent zip codes with
SELECT target FROM adjacent WHERE source = '02139';
and get all the zip codes that are adjacent to '02139'.
Now strictly speaking, my implementation is just fine. For a set of less than 50,000 total zip codes, doing it the way I did doesn't really impose any computational penalties. However, not being indexed, and having each relationship inserted twice seems non-optimal, and since I'm just doing this for funzies and learning, I should put in the effort to optimize. So I'm trying to find out how to more efficiently simulate a mapping using a mysql table.
So the question is: what is the most efficient way to represent 1-to-n mappings using MySQL?

In your application, the concept of adjacency seems to be bidirectional (aka symmetric). That is,
A adj B if and only if B adj A
So you can consider "canonicalizing" the relationship and then always store the zip with a smaller numerical value in the first column and the one with the larger numerical value in the second column. That is, using your example, you now only have one row:
+-----------+------------+
| zipLower | zipHigher |
+-----------+------------+
| 02138 | 02139 |
+-----------+------------+
And then when you need to find all the neighboring zip of, say, 02139, your
query may look like this (assuming the new table is called adjHigher):
SELECT zipHigher as zip
FROM adjHigher
WHERE zipLower = '02139'
union
SELECT zipLower as zip
FROM adjHigher
WHERE zipHigher = '02139'
Pros and cons
Is this really a more optimal design? It depends. This design
uses half of the storage space, and insertion into the table may
be more efficient (only one row, not two rows, per adjacent relationship
is needed to be insert). However, you can also see that the lookup
query becomes more complicated. If you have to JOIN this
table with other tables, this design may make your JOINs more complicated.
I guess the intention of this discussion is to explore different
design options before committing to one. So here it is.

Related

EAV vs null vs Mixed

I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).

Hierarchical Query Advantages

I have some huge database tables filled with scientific names, in a parent-child relationship, like this...
TAXON | PARENT
Mammalia | Chordata
Carnivora | Mammalia
Canidae | Carnivora
Canis | Canidae
Canis-lupus | Canis
I installed PostgreSQL and started working on a hierarchical query, but it's far more complex than I thought. So I'm thinking of sticking with MySQL and going back to my original scheme, which looks like this:
TAXON | PARENT | FAMILY | ORDER
Mammalia | Chordata | (NULL) | (NULL)
Carnivora | Mammalia | (NULL) | Carnivora
Canidae | Carnivora | Canidae | Carnivora
Canis | Canidae | Canidae | Carnivora
Canis-lupus | Canis | Canidae | Carnivora
It looks amateurish, but I was surprised to discover that the Catalogue of Life apparently uses the same scheme, with more columns and over a million rows.
With this scheme, I can count children and grandchildren by simply counting the number of species that match Table.Family > Canidae, for example. And I can use a series of "stairstep" queries to figure out the names of the great grandparents, etc.
So I wondered what the benefits of hierarchical queries are. They're more elegant, and you can presumably do everything with just one or two queries, rather than a series of queries. I also assume they're faster, though my original query, with the two extra fields, is fast enough.
Do hierarchical queries have some additional significant advantage that would justify me hiring someone to set one up, or is it primarily just a matter of speed?
A recursive / hierarchical query is often actually slower. It varies - there are many more rows, but on the other hand each row is much smaller.
The main advantage is flexibility, rather than performance. In your table you have a set number of columns... but what if there was any number of possible steps between ultimate parent (root) and ultimate child (leaf)? Or branches that join as well as open, so that one object has two parents? That's when hierarchical queries become more useful.
If by hierarchical queries, you mean Postgresql Common Table Expressions; the answer is that they are a wonderfull feature that would allow you to write more readable queries and in some (but not all) cases lead to improved performance.
Is it really worth hiring someone to install postgresql for you? Maybe, maybe not. It's hard to say without benchmarks.
What you really ought to try: is Modified Pre order Tree Traversal now that sounds complicated, but it isn't
We’ll start by laying out our tree in a horizontal way. Start at the root node (‘Food’), and write a 1 to its left. Follow the tree to ‘Fruit’ and write a 2 next to it. In this way, you walk (traverse) along the edges of the tree while writing a number on the left and right side of each node. The last number is written at the right side of the ‘Food’ node. In this image, you can see the whole numbered tree, and a few arrows to indicate the numbering order.
Here is another excellent article on it. http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
This approach can be used in both postgreql and mysql and the existing data can be converted without too much difficulty.

How to reference groups of records in relational databases?

Humans
| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
| 1 | Issac | Newton | M |
| 2 | Marie | Curie | F |
| 3 | Tim | Duncan | M |
Animals
| AmimalID | Species | NickName |
|----------+---------+----------|
| 4 | Tiger | Ronnie |
| 5 | Dog | Snoopy |
| 6 | Dog | Bear |
| 7 | Cat | Sleepy |
How do I reference a group of records in other tables?
For example:
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+---------|
| 8 | Rice | ??? |
What I want to store in EatenBy may be:
a single record in the Humans and Animals tables (e.g. Tim Ducan)
a group of records in a table (e.g. all dogs, all males, all females)
a whole table (e.g. all humans)
A simple solution is to use a concatenated string, which includes primary keys
from different tables and special strings such as 'Humans' and 'M'.
The application could parse the concatenated string.
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+--------------|
| 8 | Rice | Humans, 6, 7 |
Using a concatenated string is a bad idea from the perspective of
relational database design.
Another option is to add another table and use a foreign key.
Foods
| FoodID | FoodName |
|--------+----------|
| 8 | Rice |
EatenBy
| FoodID | EatenBy |
|--------+---------|
| 8 | Humans |
| 8 | 6 |
| 8 | 7 |
It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?
The example tables here are a bit contrived, but I do run into situations like
this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.
This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.
The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.
The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.
Response to Initial Question
You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.
The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.
If you want a Relational database, then you are going to have to:
think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).
Then classifying the data (no relation whatever to the OO class or classifier concept).
Then relating the classified data.
The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.
No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.
In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.
I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.
Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.
The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".
Therefore, I would ask you to:
forget about "table structures" or definitions
forget about what you want
forget about implementation options
forget ID columns, completely and totally
forget EatenBy
think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it
think about how that data is classified, and how it can be classified.
how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)
If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:
Things
Groups of Things
Relationships between Things and ThingGroups
The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.
Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.
Response to Update 1 re Hierarchy
Record IDs are Physical, Non-relational
If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.
Please read this Answer.
Hierarchies Exist in the Data
If you want a full discourse, please ask a new question. Here is a quick summary.
Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.
The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.
Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.
There are three types of Hierarchies that occur in data.
Hierarchy Formed in Sequence of Tables
This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.
The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.
The Hidders Response has a great example of Hierarchies:
a. that exist naturally in the data
b. that OO types are blind to [as Hidders evidently is]
c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.
Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.
The two hierarchies in those four tables are:
Domain::Animal::Harvest
Domain::Activity::Harvest
Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.
Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:
Subject::CategorySubject::ExaminationResult
Category::CategorySubject::ExaminationResult
Person::Registrant::Candidate::ExaminationResult
Note that last one is a progression of state of the business instrument, thus the Key does not compound.
Hierarchy of Rows within One Table
Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).
The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.
You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.
Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:
Message::Message[Message]::Message[::Message[Message]] ...
Hierarchy of Rows within One Table, Via an Associative Table
This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.
The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.
Here is the Bill of Materials hierarchy, modelled and implemented correctly.
Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.
Person::Progeny is also given.
The hierarchies are:
Part[Assembly]::Part[Component] ...
Part[Component]::Part[Assembly] ...
Person[Parent]::Person[Child] ...
Person[Child]::Person[Parent] ...
Ignorance Of Hierarchy
Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.
Adjacency List
The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.
Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.
And they are clueless as to how to traverse the tree, that one needs recursion.
Nested Sets
An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.
Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.
The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.
Ok, it wasn't so short.
Response to Update 2
Response to Update 2
Response to Update 3
Response to Update 3
Response to Update 4
Response to Update 4
For each category who eats the food, you should add one table. for example, if one food may be eaten by some specific gender, you would have:
Food_Gender(FoodID,GenderID)
for humans you would have:
Food_Human(FoodID,HumanID)
for animals species:
Food_AnimalSpc(FoodID,Species)
for an entire table:
Food_Table(FoodID,TableID)
and so on for other categories

How to store frequently modified lists in database in a natural way so that they are just ready to read?

For a social network site, I need to store frequently modified lists for each entity(& millions of such entities) which are:
frequently appended to
frequently read
sometimes reduced
lists are keyed by primary key
I'm already storing some other type of data in an RDBMS. I know that I could store those lists in an RDBMS as a many to many relationship like this way: Create a table listItems with two columns listId & listItem & to generate any particular list, just do a SELECT query for all records WHERE listId = x. But storing lists this way in an RDBMS is not very ideal when high scalability is concerned. Instead I would like to store prepared lists in a natural way, so that retrieval performance is maximized. Because I need to fetch around hundred of such lists for a user, whenever I user does login & view a page.
So how do I solve this ? What kind of database should be used for this data, probably the one that provide adding variable no of columns to keyed by a primary key, the ones like Cassandra ?
I used the same method that is, to store a 2 column row for every record, which I turned to a txt file with the formatted html which then we changed to json and finally to mongodb.
But since you have got frequent operations, I suggest cassandra, hbase and googles big table implementations like accumulo cloudata and hypertable.
Cloudata may be the right one for you.
As you pointed out the solution must be performant and scaleable: I'd suggest you to use Redis with it's LIST data structure and O(1) inserts and O(N) fetches (N - elements to fetch, considering you're fetching last ones from lists) and scale it horizontally with some hashing algorithm. I don't know what amount of data you are going to store and how many machines are available, but definitely it will be the best choice performance-wise, since nothing beats memory access speed.
If the amount of data is huge and you can't keep it all in RAM then Cassandra can do the job - storing lists ordered by time is a nice fit for it even better with partition strategy as Zanson mentioned above.
One more thought: you said read performance must be max, and once user logs in you will need to fetch hundred of lists for this user. Why not to prepare a single list for each user? That way there will be more writes but the read will be optimized since you will need to fetch only latest entries from one list. I'm not sure if that fits your task, just a thought. :)
I would recommend SSDB(https://github.com/ideawu/ssdb), a Google leveldb network wrapper. SSDB is designed to store collection data, such as list, map, zset(sorted set). You can use it like this way:
ssdb->hset(listId, listItem1);
ssdb->hset(listId, listItem2);
ssdb->hset(listId, listItem3);
...
list = ssdb->hscan(listId, 100);
// now list = [listItem1, listItem2, listItem3, ...]
The number of items in one map is only limited to the size of hard disk. Another solution is Redis, but Redis stores all data into memory(say no more than 30GB), so it probably won't fit your project.
C++, PHP, Python, Java, Lua, and more clients are supported by SSDB.
Cassandra has native support for storing sets/maps/lists. If your queries will always be pulling the whole thing down, then they are a very easy way to deal with this type of thing.
http://www.datastax.com/dev/blog/cql3_collections
http://cassandra.apache.org/doc/cql3/CQL.html#collections
If your lists are tied to a user, you can make the different columns on the users row/partition, and then queries for the multiple lists will be fast, as they will all be in the same partition for a given user.
Cassandra can be used very well for such use cases. Create as many Columnfamilies as you want for the returned data sets/queries. Cassandra works best with de-normalized data or sets like 1:m, m:m relations.
I know you didn't want to consider relational databases, but I think for this simple situation, there is also a scalable solution with relational database. The main benefit would be that you don't need to maintain a separate database system.
To gain scalability, all NoSQL solutions will distribute your data across multiple nodes. You can do this in your application code, spreading your data out across multiple relational databases. To keep the load balanced, you may need to move data occasionally, but it may be sufficient to simply spawn a new database for every N lists.
In cassandra you can have wide rows, up to 2B columns per row... if that's enough for an entity's cumulative lists' item, you can store whole entity's lists in a single row then retrieve it all together.
with cassandra's "composite column" you can store elements of each list sequentially and ordered and you can delete a single column(a list item) when you want, and when you have an insertion you just need to insert a column...
something like this: (!)
|list_1_Id : item1Id |list_1_Id : item2Id | list_2_Id : item1Id |...| list_n_Id : item3Id |
entity| item1Value | item2Value | item1Value |...| item3Value |
so practically you deal with columns(=items) rather than lists... and it makes your work much easier.
depends on your lists size cosider using spliting entiti's row to multiple rows...
something like this: (!)
| item1Id | item2Id | item3Id | item4Id |...
entiId_list_1_Id | item1Value | item2Value | item3Value | item4Value |...
| item1Id | item2Id | item3Id | item4Id |...
entiId_list_2_Id | item1Value | item2Value | item3Value | item4Value |...
...
and you can put itemValue in column name and leave column value empty to reduce size...
for example you can insert a new item by simply doing:
//columns are sorted by their id if they have any
insert into entityList[entityId][listId][itemId] = item value;
or
//columns are sorted by their value
insert into entityList[entityId][listId][itemvalue] = nothing;
and delete:
delete from entityList where entityId='d' and listId='o' and itemId='n';
or via you application you can do it by using a rich client like Hector...

Which of these MySQL database designs (attached) is best for mostly read high performance?

I am a data base admin and developer in MySQL. I have been a couple of years working with MySQL. I recently adquire and study O'Reilly High Performance MySQL 2nd Edition to improve my skills on MySQL advanced features, high performance and scalability, because I have often been frustated by the lack of advance knowledge of MySQL I had (and in a big part, I still have).
Currently, I am working on a ambicious web project. In this project, we will have quite content and users from the begining. I am the designer of the data base and this data base must be very fast (some inserts but mostly and more important READS).
I want here to discuss about these requirements:
There will be several kind of items
The items have some fields and relations in common
The items also have some fields and relations special that make them differents each other
Those items will have to be listed all together ordered or filtered by common fields or relations
The items will have to be also listed only by type (for examble item_specialA)
I have some basic design doubts, and I would like you to help me decide and learn which design aproach would be better for a high performance MySQL data base.
Classical aproach
The following diagram shows the clasical aproach which is the first you may think about with the mind thinking in database: Database diagram
Centralized aproach
But maybe we can improve it with some or pseudo object oriented paradigm centralicing the common items and the relations on one common item table. It would also be useful for listing all kind of items: Database diagram
Advantages and disadvantages of each one?
Which aproach would you choose or which changes would you apply seeing the requirements seen before?
Thanks all in advance!!
What you have are two distinct data mapping strategies. That you called "classical" is "one table per concrete class" in other sources, and that you called "centralized" is "one table per class" (Mapping Objects to Relational Databases: O/R Mapping In Detail). They both have their advantages and disadvantages (follow the link above). The queries in the first strategy will be faster (you will need to join only 2 tables vs 3 in the second strategy).
I think that you should explore classic supertype/subtype pattern. Here are some examples from the SO.
If you're looking mostly for speed, consider selective use of MyISAM tables, use a centralized "object" table, and just one additional table with correct indexes on this form:
object_type | object_id | property_name | property_value
user | 1 | photos | true
city | 2 | photos | true
user | 5 | single | true
city | 2 | metro | true
city | 3 | population | 135000
and so on... lookups on primary keys or indexed keys (object_type, object_id, property_name) for example will be blazing fast. Also, you reduce the need to end with 457 tables as new properties appear.
It isn't exactly a well-designed nor perfectly-normalized database and, if you are looking for a long-term big site, you should consider caching, or at least using a denormalized paradigm, denormalized mysql tables like this one, redis, or maybe MongoDB.