Chicken and Egg: Database design

Chicken and Egg: Database design - mysql

I have this table, I bet that looking at the tables, you will know my problem already :)
content_table
--------------------------------------
| id | title | type | parent_id |
--------------------------------------
| 0 | Root | Page | 0 |
|100 | Home | Page | 1 |
|101 | Main Text |Section| 1 |
|102 | About | Page | 1 |
|301 | Foo | Text | 245 |
|302 | About Us | Text | 246 |
--------------------------------------
paging_table
---------------------------------
| page_id | section_id | rel_id |
---------------------------------
| 0 | 0 | 1 |
| 100 | 101 | 245 |
| 102 | 101 | 246 |
---------------------------------
section_options
----------------------------
| section_id | option_mask |
----------------------------
| 101 | 65535 |
----------------------------
*paging_table.page_id and paging_table.section_id
both have FOREIGN KEYs on content_table.id
section_options.section_id has a FOREIGN KEY on content_table.id
So basically I have a CMS and I want to treat EVERYTHING as a content, be it a page, a page section, or the actual contents of the pages themselves.
Secondly, since some page sections will be quite similar, I decided that I need not create multiple sections (e.g. home_main_text, about_main_text, etc...). I just need to create a generic section and have the paging_table take care of the rest since sections will also have a whole lot of display options with them (stored in another table that has a reference to content_table.id). If I am to have similar sections with very similar options stored in two rows, that would look bad wouldn't it?
Then I created a root content (the one with id = 0 at the content_table). All main pages and sections will have the root as their parent.
My problem now is that I want to put a FOREIGN KEY on parent_id that references to the rel_id column. But I have the Root element to worry about. I already feel like I am doing a hack on the first row of the paging_table. I am now feeling a chicken and egg scenario for the root content. Do you think there really is a necessity for the root content? How about the generic section approach? I just want a better design of this database :), or maybe an overall redesign of architecture of the CMS since I'm just starting and I really haven't done much yet.
Criticisms are very much welcome (just be constructive). If there is anything vague, please comment and I will try to clear it up, I just am having a hard time articulating what I have in mind and it would really be a hassle if I simply sent you the source code the classes that I am building. Thanks!
EDIT
I've edited the id's to make the references clear

I don't really see a problem there. I would just leave the parent_id of Root to Null: it has no parent, and it is NOT his own parent.
Otherwise, SQL Server (and probably some other RDBMS) has hierarchical capabilities.

Let me be blasphemic: relational databases are not suitable for this kind of task - building hierarchies with relations clearly sucks. I also did same mistake once, and would never do this again. I created small and lightweight CMS with just file system as storage, and XML documents. Other concepts like versioning, replication, workflow are easy to put on
with (surprise!!!!) - some source versioning system like git or svn.
Another option would be document oriented database like MongoDB (there are others, but I'm most familiar with mongo now) - no schema, easy hiarachies, scales out well - what else you need? ( and there is PHP driver )
To hell with normalized data ;)

Your Section points to a Content record now, this is good.
However, you need to get rid of the awkward paging_table:
Each Section may point to a Page and has an integer describing the "order" in that parent relationship.
If a Section does not point to a Page, it points to another Section, you can reuse the "order" field.
So you have parent_page and parent_section fields, one of which may be NULL. If you're crazy about normalizing you'll need more Section tables, but you may need more than you think.
Note that you will lose hierarchic information in your content_table, but this is OK since there is nothing generally hierarchic about all "content". Only sections are hierarchic.
An even simpler way would be to see a Page as just a type of Section that does not have a parent Section. But I don't know enough of the other data that may be involved in pages. In a regular Wiki I would use that, however.
EDIT:
If you really need to "reuse" the actual Section records, you need a SectionAssignment table that allows a m-n relationship between Sections and Pages. SectionAssignment will have four fields: assignment_id, section_id, page_id, and order.

Related

How to index database?

This is killing me - everybody say what it is but noone points to a guide or teach the basics.
Is it something that is better done from the start or can you index it as easily if your loading times are getting longer?
Has anyone found any good starting point for someone who's not a pro in databases? (I mean indexing starting point and don't worry, I know the basics of databases) Main rules, good practise etc.
Im not here to ask you to write a huge tutorial but if you're really, really bored - go ahead. :)
Im using Wordpress if that's important to know. Yes, I know that WP uses very basic indexing but if it's something good to start with from the beginning, I can't see a reason why not to.
It's barely related but I also didn't find answer online. I can guess the answer but Im not 100% sure - what's more efficient way to store data with same key: in array or separate rows (separate ids but same keys)? There's usually maximum of 20 items per post & the number of posts could be in thousands in future. Which would be a better solution?
Different rows, ids & values BUT same key
id | key |values|
--------------------
25 | Bob | 3455 |
--------------------
24 | Bob | 1654 |
--------------------
23 | Bob | 8432 |
Same row, id & key BUT value is serialized array
id | key | values |
------------------------------
23 | Bob | serialized array |
------------------------------

If you want a quick rule of thumb, index any columns in a table that you will be using to lookup rows. For example, I may have a table as follows:
id| Name| date |
--------------------
0 | Bob | 11.12.16 |
--------------------
1 | John| 15.12.16 |
--------------------
2 | Tim | 19.12.16 |
So obviously your ID is your primary index, but lets say you have a page that will SORT the whole table by DATE, well you would add date as an index.
Basically, indexes make it a lot faster for the engine to find specific records or order them by a specific column. They do a lot more, but when I am designing sites for myself or little tools for the office at work, I usually just go by that.
Large corporate tables can have thousands of indexes and even more relations between tables, but usually for us small peasant folk, what I said should be enough.

You're asking a really complicated question. But the tl;dr; A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure.
more detailed info is already provided in the thorough answer here:
How does database indexing work?

Whether to merge avatar and profile tables?

I have two tables:
Avatars:
Id | UserId | Name | Size
-----------------------------------------------
1 | 2 | 124.png | Large
2 | 2 | 124_thumb.png | Thumb
Profiles:
Id | UserId | Location | Website
-----------------------------------------------
1 | 2 | Dallas, Tx | www.example.com
These tables could be merged into something like:
User Meta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | location | Dallas, Tx
2 | 2 | website | www.example.com
3 | 2 | avatar_lrg | 124.png
4 | 2 | avatar_thmb | 124_thumb.png
This to me could be a cleaner, more flexible setup (at least at first glance). For instance, if I need to allow a "user status message", I can do so without touching the database.
However, the user's avatars will be pulled far more than their profile information.
So I guess my real questions are:
What king of performance hit would this produce?
Is merging these tables just a really bad idea?

This is almost always a bad idea. What you are doing is a form of the Entity Attribute Value model. This model is sometimes necessary when a system needs a flexible attribute system to allow the addition of attributes (and values) in production.
This type of model is essentially built on metadata in lieu of real relational data. This can lead to referential integrity issues, orphan data, and poor performance (depending on the amount of data in question).
As a general matter, if your attributes are known up front, you want to define them as real data (i.e. actual columns with actual types) as opposed to string-based metadata.

In this case, it looks like users may have one large avatar and one small avatar, so why not make those columns on the user table?
We have a similar type of table at work that probably started with good intentions, but is now quite the headache to deal with. This is because it now has 100s of different "MetaKeys", and there is no good documentation about what is allowed and what each does. You basically have to look at how each is used in the code and figure it out from there. Thus, figure out how you will document this for future developers before you go down that route.
Also, to retrieve all the information about each user it is no longer a 1-row query, but an n-row query (where n is the number of fields on the user). Also, once you have that data, you have to post-process each of those based on your meta-key to get the details about your user (which usually turns out to be more of a development effort because you have to do a bunch of String comparisons). Next, many databases only allow a certain number of rows to be returned from a query, and thus the number of users you can retrieve at once is divided by n. Last, ordering users based on information stored this way will be much more complicated and expensive.
In general, I would say that you should make any fields that have specialized functionality or require ordering to be columns in your table. Since they will require a development effort anyway, you might as well add them as an extra column when you implement them. I would say your avatar pics fall into this category, because you'll probably have one of each, and will always want to display the large one in certain places and the small one in others. However, if you wanted to allow users to make their own fields, this would be a good way to do this, though I would make it another table that can be joined to from the user table. Below are the tables I'd suggest. I assume that "Status" and "Favorite Color" are custom fields entered by user 2:
User:
| Id | Name |Location | Website | avatarLarge | avatarSmall
----------------------------------------------------------------------
| 2 | iPityDaFu |Dallas, Tx | www.example.com | 124.png | 124_thumb.png
UserMeta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | Status | Hungry
2 | 2 | Favorite Color | Blue

I'd stick with the original layout. Here are the downsides of replacing your existing table structure with a big table of key-value pairs that jump out at me:
Inefficient storage - since the data stored in the metavalue column is mixed, the column must be declared with the worst-case data type, even if all you would need to hold is a boolean for some keys.
Inefficient searching - should you ever need to do a lookup from the value in the future, the mishmash of data will make indexing a nightmare.
Inefficient reading - reading a single user record now means doing an index scan for multiple rows, instead of pulling a single row.
Inefficient writing - writing out a single user record is now a multi-row process.
Contention - having mixed your user data and avatar data together, you've forced threads that only one care about one or the other to operate on the same table, increasing your risk of running into locking problems.
Lack of enforcement - your data constraints have now moved into the business layer. The database can no longer ensure that all users have all the attributes they should, or that those attributes are of the right type, etc.

What is the best way to handle these MySQL database relationsships?

I'm building a small website that let users recommend their favourite books to eachother. So I have two tables, books and groups. A user can have 0 or more books in their library, and a book belongs to 1 or more groups. Currently, my tables look like this:
books table
|---------|------------|---------------|
| book_id | book_title | book_owner_id |
|---------|------------|---------------|
| 22 | something | 12 |
|---------|------------|---------------|
| 23 | something2 | 12 |
|---------|------------|---------------|
groups table
|----------|------------|---------------|---------|
| group_id | group_name | book_owner_id | book_id |
|----------|------------|---------------|---------|
| 231 | random | 12 | 22 |
|----------|------------|---------------|---------|
| 231 | random | 12 | 23 |
|----------|------------|---------------|---------|
As you can see, the relationsships between users+books and books+groups are defined in the tables. Should I define the relationsships in their own tables instead? Something like this:
books table
|---------|------------|
| book_id | book_title |
|---------|------------|
| 22 | something |
|---------|------------|
| 23 | something2 |
|---------|------------|
books_users_relationsship table
|---------|------------|---------|
| rel_id | user_id | book_id |
|---------|------------|---------|
| 1 | 12 | 22 |
|---------|------------|---------|
| 2 | 12 | 23 |
|---------|------------|---------|
groups table
|----------|------------|
| group_id | group_name |
|----------|------------|
| 231 | random |
|----------|------------|
groups_books_relationsship table
|----------|---------|
| group_id | book_id |
|----------|---------|
| 231 | 22 |
|----------|---------|
| 231 | 23 |
|----------|---------|
Thanks for your time.

The second form with four tables is the correct one. You could delete rel_id from books_users_relationsship as primary key might be composite with both user_id and book_id, just like in groups_books_relationsship table.

You do not need a "relationship table" to support a relationship. In Databases, implementing a Foreign Key in a child table defines the Relation between the parent and the child. You need tables only if they contain data, or to resolve a many-to-many relationship (and that has no data other than the Primary Keys of the parents).
The second problem you are facing, the reason the Relations become complex, and even optional, is due to the first two tables not being Normalised. Many problems ensue from that.
if you look closely at book, you may notice that the same book (title) gets repeated
likewise, there is no differentiation between (a) a book in terms of its existence in the world and (b) a copy of a book, that is owned by a member, and available for borrowing
eg. the review is about an existing book, once, and applies to all copies of a book; not to an owned book.
your "relationship" tables also have data in them, and the data is repeated.
all this repeated data needs to be maintained and kept in synch.
all those problems are eliminated if the data is Normalised.
Therefore (since you are seeking the "best way"), the sequence is to normalise the data first, after which (no surprise) the Relations are easy and not complex, and no data is repeated (in either the tables or the relations).
when Normalising, it is best to model the real world (not the entire real world, but whatever parts of it that you are implementing in the database). That insulates your database from the effects of change, and functional extensions to it in future do not require the existing tables to be changed.
It is also important to use accurate names for tables and columns, for the same reason. group in non-specific and will cause a problem in future when you implement some other form of grouping.
The relations can be now defined at the correct "level", between the correct tables.
The need to stick an Id column on everything that moves severely hinders your ability to understand the data and thus the Normalisation process, and robs the database of Relational power.
Notice that the existing keys are already unique and meaningful, short and efficient, no additional surrogate keys (and their additional index) is required.
ReviewerId, OwnerId and BorrowerIdare allMemberIds`, as Foreign Keys, showing the explicit Role in which they are used.
Note that your problem space is not as simple as you think, it is used as a case study and shipped with tutorials for SQL (eg. MS SQL, Sybase).
Social Library Data Model
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notational useful.
I have provided the structure required to support borrowing, to again illustrate how easy it is to implement Relations on Normalised data, and to show the correct tables upon which borrowing depends (it is not between any book and any person; only owned book can be borrowed).
These issues are very important because they define the Referential Integrity of the database.
It is also important to implement that in the database itself, which is the Standard location (rather than in app code all over the place). Declarative Referential Integrity is part of IEC/ISO/ANSI Standard SQL. And the question has a database design tag.
Referential Integrity cannot be defined or enforced in some databases that do not fully implement the SQL Standard (sometimes it can be defined but it is not enforced, which is confusing). Nevertheless, you can design and implement whatever parts of a database your particular database supports.

On a stats-system, should I save little bits of information about single visit on many tables or just one table?

I've been wondering this for a while already. The title stands for my question. What do you prefer?
I made a pic to make my question clearer.
Why am I even thinking of this? Isn't one table the most obvious option? Well, kind of. It's the simpliest way, but let's think more practical. When there is a ton of data in one table and user wants to only see statistics about browsers the visitors use, this may not be as successful. Taking browser-data out of one table is naturally better.
Multiple tables has disadvantages too. Writing data takes more time and resources. With one table there's only one mysql-query needed.
Anyway, I figured out a solution, which I think makes sense. Data is written to some kind of temporary table. All of those lines will be exported to multiple tables later (scheduled script). This way the system doesn't take loading-time from the users page, but the data remains fast to browse.
Let's bring some discussion here. I'm hoping to raise some opinions.
Which one is better? Let's find out!

The date, browser and OS are all related on a one-to-one basis... Without more information to require distinguishing records further, I'd be creating a single table rather than two.
Database design is based on creating tables that reflect entities, and I don't see two distinct entities in the example provided. Consider using views to serve data without duplicating the data in the database; a centralized copy of the data makes managing the data much easier...

What you're really thinking of is whether to denormalize the table or use the first normal form. When you're using 1NF you have a table that looks like this:
Table statistic
id | date | browser_id | os_id
---------------------------------------------
1 | 127003727 | 1 | 1
2 | 127391662 | 2 | 2
3 | 127912683 | 3 | 2
And then to explain what browser and os the client used, you need other tables:
Table browser
id | name | company | version
-----------------------------------------------
1 | Firefox | Mozilla | 3.6.8
2 | Safari | Apple | 4.0
3 | Firefox | Mozilla | 3.5.1
Table os
id | name | company | version
-----------------------------------------------
1 | Ubuntu | Canonical | 10.04
2 | Windows | Microsoft | 7
3 | Windows | Microsoft | 3.11
As OMG Ponies already pointed out, this isn't a good example to be creating several entities, so one can safely go with one table and then think about how he/she is going to deal with having to, say, find all the entries with a matching browser name.

should i really use a relation table when tagging blog posts?

while trying to figure out how to tag a blog post with a single sql statement here, the following thought crossed my mind: using a relation table tag2post that references tags by id as follows just isn't necessary:
tags
+-------+-----------+
| tagid | tag |
+-------+-----------+
| 1 | news |
| 2 | top-story |
+-------+-----------+
tag2post
+----+--------+-------+
| id | postid | tagid |
+----+--------+-------+
| 0 | 322 | 1 |
+----+--------+-------+
why not just using the following model, where you index the tag itself as follows? taken that tags are never renamed, but added and removed, this could make sense, right? what do you think?
tag2post
+----+--------+-------+
| id | postid | tag |
+----+--------+-------+
| 1 | 322 | sun |
+----+--------+-------+
| 2 | 322 | moon |
+----+--------+-------+
| 3 | 4443 | sun |
+----+--------+-------+
| 4 | 2567 | love |
+----+--------+-------+
PS: i keep an id, i order to easily display the last n tags added...

It works, but it is not normalized, because you have redundancy in the tags. You also lose the ability to use the "same" tags to tag things besides posts. For small N, optimization doesn't matter, so I have no problems if you run with it.
As a practical matter, your indexes will be larger (assuming you are going to index on tag for searching, you are now indexing duplicates and indexing strings). In the normalized version, the index on the tags table will be smaller, will not have duplicates, and the index on the tag2post table on tagid will be smaller. In addition, the fixed size int columns are very efficient for indexing and you might also avoid some fragmentation depending on your clustering choices.
I know you said no renaming, but in general, in both cases, you might still need to think about the semantics of what it means to rename (or even delete) a tag - do all entries need to be changed, or does the tag get split in some way. Because this is a batch operation in a transaction in the worst case (all the tag2post have to be renamed), I don't really classify it as significant from a design point of view.

This sounds fine to me, using an ID to reference something that you delegated into another table makes sense when you have things that vary, say a user's name or whatever, because you don't want to change it's name in every place in your database when he changes it. However in this case the tag names themselves will not vary, so the only potential downside I see is that a text index might be slightly slower than a numeric index to search through.

Where is the real advantage of your proposal over a relation table containing IDs?
Technically they solve the same problem, but your proposed solution does it in a redundant, de-normalized way that only seems to satisfy the instinctive urge to be able to read the data directly from the relation table.
The DB server is pretty good at joining tables, and even more so if the join is over an INT field with an index on it. I don't think you will be facing devastating performance issues when you join another table (like: INT id, VARCHAR(50) TagName) to your query.
But you lose the ability to easily rename a tag (even if you don't plan on doing so), and you needlessly inflate your relation table with redundant data. Over time, this may cost you more performance than the normalized solution.

The de-normalised method may be fine depending on your application.
You may find that it causes a performance hit due to searching a large set of VARCHAR data.
When doing a search for things tagged like "sun*" (e.g. sun, sunny, sunrise)
you will not need to do a join. However, you will need to do a like comparison on a MUCH larger set of VARCHAR data. Proper indexing may alleviate this issue but only testing will tell you which method is faster with your dataset.
You also have the option of adding a VIEW that pre-joins the normalised tables. This gives you simpler queries while still allowing you to have highly normalised data.
My recommendation is to go with a normalised structure (and add de-normalised views a necessary for ease of use) until you encounter an issue that de-normalising the data schema fixes.

I was considering that too. Want a list of tags in the database, just select distinct tag from tag2post. I was told that since I wanted to optimize for select statements, it would be better to use an integer key because it was much faster than using a string.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008