I'm building a PHP/MySQL website and I'm currently working on my database design. I do have some database and MySQL experience, but I've never structured a database from scratch for a real world application which hopefully is going to get some good traffic, so I'd love to hear advices from people who've already done it, in order to avoid common mistakes. I hope my explanations are not too confusing.
What I need
In my application, the user should be able to write a post (title + text), then create an "object" (which can be anything, like a video, or a song, etc.) and attach it to the post. The site has a list of predefined object types the user can create, and I should be able to add new types in the future. The user should also have the ability to see the object's details in a dedicated page and add a comment to it - the same applies to posts.
What I tried
I created an objects table with these fields: oid, type, name and date. This table contains records for anything the user should be able to add comments to (i.e. posts and objects). Then I created a postmeta table which contains additional post data (such as text, author, last edit date, etc.), a videometa table for data about the "video" object (URL, description, etc.), and so on. A postobject table (pid,oid) links objects to posts. Additionally, there's a comments table which contains the comment text, the author and the ID of the object it refers to.
Since the list of object types is predefined and is probably not going to change (though I still need the ability to add a type easily at any time without changing the app's code structure or the database design), and it is relatively small, it's not a problem to create a "meta" table for each type and make a corresponding PHP class in my application to handle it.
Finally, a page on the site needs to show a list of all the posts including the objects attached to it, sorted by date. So I get all the records from the objects table with type "post" and join it with postmeta to get the post metadata. Then I query postobject to get all the objects attached to this post, and comments to get all the comments.
The questions
Does this make any sense? Is it any good to design a database in this way for a real world site? I need to join quite a few tables to get all the data I need, and the objects table is going to become huge since it contains almost every item (only the type, name and creation date, though) - this is to keep the database and the app code flexible, but does it work in the real world, or is it too expensive in the long term? Am I thinking about it in the wrong way with this kind of OOP approach?
More specifically: suppose I need to list all the posts, including their attached objects and metadata. I would need to join these tables, at least: posts, postmeta, postobject and {$objecttype}meta (not to mention an users table to get all posts by a specific user, for example). Would I get poor performance doing this, even if I'm using only numeric indexes?
Also, I considered using a NoSQL database (MongoDB) for this project (thanks to Stuart Ellis advice). Apparently it seems much more suitable since I need some flexibility here. But my doubt is: metadata for my objects includes a lot of references to other records in the database. So how would I avoid data duplication if I can't use JOIN? Should I use DBRef and the techniques described here? How do they compare to MySQL JOINs used in the structure described above in terms of performance?
I hope these questions do make any sense. This is my first project of this kind and I just want to avoid to make huge mistakes before I launch it and find out I need to rework the design completely.
I'm not a NoSQL person, but I wonder whether this particular case might actually be handled best with a document database (MongoDB or CouchDB). Various type of objects with metadata attached sounds like the kind of scenario that MongoDB is designed for.
FWIW, you've got a couple of issues with your table and field naming that might bite you later. For example, type and date are rather generic, and also reserved words. You've also mixed singular and plural table names, which will throw any automatic object mapping.
Whichever database you use, it's a good idea to find an existing set of database naming conventions and apply it from the start - this will help you avoid subtle issues and ensure that your naming stays consistent. I tend to use the Rails naming conventions ATM, because they are well-known and fairly sensible.
Or you could store the object contents as a file, outside of the database, if you're concerned about the database space.
If you store anything in the database, you already have the object type in objects; so you could just add object_contents table with a long binary field to store the object. You don't need to create a new table for each new type.
I've seen a lot of JOIN's in real world web application (5 to 10). Objects table may get large, but that's indices are for. So far, I don't see anything wrong in your database. BTW, what felt strange to me - one post, one object, and separate comments for each? No ability to mix pictures with text?
Related
This is a question that has probably been asked before, but I'm having some difficulty to find exactly my case, so I'll explain my situation in search for some feedback:
I have an application that will be registering locations, I have several types of locations, each location type has a different set of attributes, but I need to associate notes to locations regardless of their type and also other types of content (mostly multimedia entries and comments) to said notes. With this in mind, I came up with a couple of solutions:
Create a table for each location type, and a "notes" table for every location table with a foreign key, this is pretty troublesome because I would have to create a multimedia and comments table for every comments table, e.g.:
LocationTypeA
ID
Attr1
Attr2
LocationTypeA_Notes
ID
Attr1
...
LocationTypeA_fk
LocationTypeA_Notes_Multimedia
ID
Attr1
...
LocationTypeA_Notes_fk
And so on, this would be quite annoying to do, but after it's done, developing on this structure should not be so troublesome.
Create a table with a unique identifier for the location and point content there, like so:
Location
ID
LocationTypeA
ID
Attr1
Attr2
Location_fk
Notes
ID
Attr1
...
Location_fk
Multimedia
ID
Attr1
...
Notes_fk
As you see, this is far more simple and also easier to develop, but I just don't like the looks of that table with only IDs (yeah, that's truly the only objection I have to this, it's the option I like the most, to be honest).
Similar to option 2, but I would have an enormous table of attributes shaped like this:
Location
ID
Type
Attribute
Name
Value
And so on, or a table for each attribute; a la Drupal. This would be a pain to develop because then it would take several insert/update operations to do something on a location and the Attribute table would be several times bigger than the location table (or end up with an enormous amount of attribute tables); it also has the same issue of the surrogate-keys-only table (just it has a "type" now, which I would use to define the behavior of the location programmatically), but it's a pretty solution.
So, to the question: which would be a better solution performance and scalability-wise?, which would you go with or which alternatives would you propose? I don't have a problem implementing any of these, options 2 and 3 would be an interesting development, I've never done something like that, but I don't want to go with an option that will collapse on itself when the content grows a bit; you're probably thinking "why not just use Drupal if you know it works like you expect it to?", and I'm thinking "you obviously don't know how difficult it is to use Drupal, either that or you're an expert, which I'm most definitely not".
Also, now that I've written all of this, do you think option 2 is a good idea overall?, do you know of a better way to group entities / simulate inheritance? (please, don't say "just use inheritance!", I'm restricted to using MySQL).
Thanks for your feedback, I'm sorry if I wrote too much and meant too little.
ORM systems usually use the following, mostly the same solutions as you listed there:
One table per hierarchy
Pros:
Simple approach.
Easy to add new classes, you just need to add new columns for the additional data.
Supports polymorphism by simply changing the type of the row.
Data access is fast because the data is in one table.
Ad-hoc reporting is very easy because all of the data is found in one table.
Cons:
Coupling within the class hierarchy is increased because all classes are directly coupled to the same table.
A change in one class can affect the table which can then affect the other classes in the hierarchy.
Space potentially wasted in the database.
Indicating the type becomes complex when significant overlap between types exists.
Table can grow quickly for large hierarchies.
When to use:
This is a good strategy for simple and/or shallow class hierarchies where there is little or no overlap between the types within the hierarchy.
One table per concrete class
Pros:
Easy to do ad-hoc reporting as all the data you need about a single class is stored in only one table.
Good performance to access a single object’s data.
Cons:
When you modify a class you need to modify its table and the table of any of its subclasses. For example if you were to add height and weight to the Person class you would need to add columns to the Customer, Employee, and Executive tables.
Whenever an object changes its role, perhaps you hire one of your customers, you need to copy the data into the appropriate table and assign it a new POID value (or perhaps you could reuse the existing POID value).
It is difficult to support multiple roles and still maintain data integrity. For example, where would you store the name of someone who is both a customer and an employee?
When to use:
When changing types and/or overlap between types is rare.
One table per class
Pros:
Easy to understand because of the one-to-one mapping.
Supports polymorphism very well as you merely have records in the appropriate tables for each type.
Very easy to modify superclasses and add new subclasses as you merely need to modify/add one table.
Data size grows in direct proportion to growth in the number of objects.
Cons:
There are many tables in the database, one for every class (plus tables to maintain relationships).
Potentially takes longer to read and write data using this technique because you need to access multiple tables. This problem can be alleviated if you organize your database intelligently by putting each table within a class hierarchy on different physical disk-drive platters (this assumes that the disk-drive heads all operate independently).
Ad-hoc reporting on your database is difficult, unless you add views to simulate the desired tables.
When to use:
When there is significant overlap between types or when changing types is common.
Generic Schema
Pros:
Works very well when database access is encapsulated by a robust persistence framework.
It can be extended to provide meta data to support a wide range of mappings, including relationship mappings. In short, it is the start at a mapping meta data engine.
It is incredibly flexible, enabling you to quickly change the way that you store objects because you merely need to update the meta data stored in the Class, Inheritance, Attribute, and AttributeType tables accordingly.
Cons:
Very advanced technique that can be difficult to implement at first.
It only works for small amounts of data because you need to access many database rows to build a single object.
You will likely want to build a small administration application to maintain the meta data.
Reporting against this data can be very difficult due to the need to access several rows to obtain the data for a single object.
When to use:
For complex applications that work with small amounts of data, or for applications where you data access isn’t very common or you can pre-load data into caches.
I am working on a project which involves building a social network-style application allowing users to share inventory/product information within their network (for sourcing).
I am a decent programmer, but I am admittedly not an expert with databases; even more so when it comes to database design. Currently, user/company information is stored via a relational database scheme in MySQL which is working perfectly.
My problem is that while my relational scheme works brilliantly for user/company information, it is confusing me on how to implement inventory information. The issue is that each "inventory list" will definitely contain differing attributes specific to the product type, but identical to the attributes of each other product in the list. My first thought was to create a table for each "inventory list". However, I feel like this would be very messy and would complicate future attempts at KDD. I also (briefly) considered using a 'master inventory' and storing the information (e.g. the variable categories and data as a JSON string. But I figured JSON strings MySQL would just become a larger pain in the ass.
My question is essentially how would someone else solve this problem? Or, more generally, sticking with principles of relational database management, what is the "correct" way to associate unique, large data sets of similar type with a parent user? The thing is, I know I could easily jerry-build something that would work, but I am genuinely interested in what the consensus is on how to solve this problem.
Thanks!
I would check out this post: Entity Attribute Value Database vs. strict Relational Model Ecommerce
The way I've always seen this done is to make a base table for inventory that stores universally common fields. A product id, a product name, etc.
Then you have another table that has dynamic attributes. A very popular example of this is Wordpress. If you look at their data model, they use this idea heavily.
One of the good things about this approach is that it's flexible. One of the major negatives is that it's slow and can produce complex code.
I'll throw out an alternative of using a document database. In that case, each document can have a different schema/structure and you can still run queries against them.
This is a complex problem, so I'm going to try to simplify it.
I have a mysql instance on my server hosting a number of schemas for different purposes. The schemas are structured generally (not perfectly) in a EAV fashion. I need to transition information into and out of that structure on a regular basis.
Example1: in order to present the information on a webpage, I get the information, stick it into a complex object, which I then pass via json to the webpage, where I convert the json into a complex javascript object, which I then present with knockoutjs and similar things.
Conclusion: This resulted in a lot of logic being put into multiple places so that I could associate the values on the page with the values in the database.
Example2: in order to allow users to import information from a pdf, I have a lot of information stored in pdf form fields. In this case, I didn't write the pdf though, so the form fields aren't named in such a way that all of this logic is easy enough to write 3 or more times for CRUD.
Conclusion: This resulted in my copying a list of the pdf form fields to a table in the database, so that I could then somehow associate them with where their data should be placed. The problem that arose is that the fields on the pdf would need to associate with schema.table.column and the only way I found to store that information was via a VARCHAR
Neither of the examples are referring to a small amount of data (something like 6 tables in example 1 and somewhere around 1400 pdf form fields in example 2). Given Example1 and the resulting logic being stored multiple places, it seemed logical to build Example2, where I could store the relationships between data in the database where they could be accessed and changed consistently and for all involved methods.
Now, it's quite possible I'm just being stupid and all of my googling hasn't come across that there's an easy way to associate this data with the correct schema.table.column If this is the case, then telling me the right way to do that is the simple answer here.
However, and this is where I get confused. I have always been told that you never want to store information about a database in the database, especially not as strings (varchar). This seems wrong on so many levels and I just can't figure out if I'm being stupid, and it's better to follow Example1 or if there's some trick here that I've missed about database structure.
Not sure where you got "... never ... store information about a database in the database". With an EAV model it is normal to store the metamodel (the entity types and their allowable attributes) in the database itself so that it is self-describing. If you had to change the metamodel, would you rather change code or a few rows in a table?
The main drawback to EAV databases is that you lose the ability to do simple joins. Join-type operations become much more complex. Like everything else in life, you make tradeoffs depending on your requirements. I have seen self-describing EAV architectures used very successfully.
I am pretty excited about the new Mysql XMl Functions.
Now I can finally embed something like "object oriented" documents in my oldschool relational database.
For an example use-case consider a user who sings up at your website using facebook connect.
You can fetch an object for the user using the graph api, and get nice information. This information however can vary vastly. Some fields may or may not be set, some may be added over time and so on.
Well if you are just intersted in very special fields (for example friends relations, gender, movies...), you can project them into your relational database scheme.
However using the XMl functions you could store the whole object inside a field and then your different models can access the data using the ExtractValue function. You can store everything right away without needing to worry what you will need later.
But what will the performance be?
For example I have a table with 50 000 entries which represent useres.
I have an enum field that states "male", "female" (or various other genders to be politically correct).
The performance of for example fetching all males will be very fast.
But what about something like WHERE ExtractValue(userdata, '/gender/') = 'male' ?
How will the performance vary if the object gets bigger?
Can I maby somehow put an Index on specified xpath selections?
How do field types work together with this functions/performance. Varchar/blob?
Do I need fulltext indexes?
To sum up my question:
Mysql XML functins look great. And I am sure they are really great if you just want to store structured data that you fetch and analyze further in your application.
But how will they stand battle in procedures where there are internal scans/sorting/comparision/calculations performed on them?
Can Mysql replace document oriented databases like CouchDB/Sesame?
What are the gains and trade offs of XML functions?
How and why are they better/worse than a dynamic application that stores various data as attributes?
For example a key/value table with an xpath as key and the value as value connected to the document entity.
Anyone made any other experiences with it or has noticed something mentionable?
I tend to make comments similar to Pekka's, but I think the reason we cannot laugh this off is your statement "This information however can vary vastly." That means it is not realistic to plan to parse it all and project it into the database.
I cannot answer all of your questions, but I can answer some of them.
Most notably I cannot tell you about performance on MySQL. I have seen it in SQL Server, tested it, and found that SQL Server performs in memory XML extractions very slowly, to me it seemed as if it were reading from disk, but that is a bit of an exaggeration. Others may dispute this, but that is what I found.
"Can Mysql replace document oriented databases like CouchDB/Sesame?" This question is a bit over-broad but in your case using MySQL lets you keep ACID compliance for these XML chunks, assuming you are using InnoDB, which cannot be said automatically for some of those document oriented databases.
"How and why are they better/worse than a dynamic application that stores various data as attributes?" I think this is really a matter of style. You are given XML chunks that are (presumably) documented and MySQL can navigate them. If you just keep them as-such you save a step. What would be gained by converting them to something else?
The MySQL docs suggest that the XML file will go into a clob field. Performance may suffer on larger docs. Perhaps then you will identify sub-documents that you want to regularly break out and put into a child table.
Along these same lines, if there are particular sub-docs you know you will want to know about, you can make a child table, "HasDocs", do a little pre-processing, and populate it with names of sub-docs with their counts. This would make for faster statistical analysis and also make it faster to find docs that have certain sub-docs.
Wish I could say more, hope this helps.
I'm refactoring a horribly interwoven db schema, it's not that it's overly normalised; just grown ugly over time and not terribly well laid out.
There are several tables (forum boards, forum posts, idea posts, blog entries) that share virtually identical data structures and composition, but are seperated simply because they represent different "objects" from the applications perspective. My initial reaction is to put everything that has the same data structure into the same table, and use a "type" column to distinguish data when performing a select.
Am I setting myself up for a fall by adopting this "all into one" approach and allowing (potentially) so many parts of the application to access the same table? FYI, I can't see this database growing to more than ~20mb over the next year or so...
There's basically three ways to store an object inheritance hierarchy in a relational database. Each has their own pros and cons. See:
http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
http://www.martinfowler.com/eaaCatalog/classTableInheritance.html
http://www.martinfowler.com/eaaCatalog/concreteTableInheritance.html
The book is great too. Luck would have it that chapter 3 - "Mapping to Relational Databases" - is available freely as a sample chapter. You can read more about the tradeoffs in there.
I used to dislike this "all into one" approach, but after I was forced to use it on a complex project a few years ago, I became a fan. If you index the table correctly, performance should be OK. You'll want an index on the type column to speed up your sort by type operations, for instance.
I now usually recommend that you use a single table to store similar objects. The only question then, is, do you want to use subtables to store data that's specific to a certain type of object? The answer to this question really depends on how different the structure of each object type is, and how many object types you'll have. If you have 50 object types with vastly differing structures, you may want to consider storing just the consistent object parts in the main table and creating a sub table for each object type.
In your example, however, I think you'd be fine just putting it all into a single table.
For more info, see here: http://www.agiledata.org/essays/mappingObjects.html
Don't lean too much on the "applications perspective", it tends to vary over time anway. Often databases are accessed by different applications too, and it usually outlives them all ...
When simliar objects are stored in different tables the reason may be that they actually represent the same domain object, but in a different state, or in a different step in a workflow. Then it often makes sense to keep them in one table and add some simple attributes to flag the state. If the workflow, or whatever it is changes, it's easier to change the database and application too, you may not need to add more tables or classes.