Combining and querying multiple tables efficiently (STI vs MTI) - mysql

In my Rails project, I have several kinds of content types, let's say "Article," "Interview," and "Review."
In planning my project, I decided to go with separate tables for all of this (multiple tables vs single table), namely because while these content types do share some fields in common (title, teaser, etc), they also have fields unique to each content type.
I managed to get all of this working fine in my project, but hit some roadblocks doing so. Namely let's say I want to show all of these content types TOGETHER in order of their publish date (like on a categories page) and I just need to display the fields (title, teaser, etc) that they all have in common...this is problematic.
I went with the has_many_polymorphs plugin for a little while, but it was never officially updated for Rails 3 and I was using a forked branch 'hacked' together for R3. Anyway, it cause problems and spat out weird error messages now and then, and after one of my updates broke the plugin, I decided it was a good idea to abandon it.
Then I looked for a plugin that could perform SQL like UNION joins, but that plugin hasn't been updated since 2009 so I passed that by.
What I ended up doing was making three different queries for each table, then using Ruby to combine the results. (#articles + #reviews +, etc) Not efficient, but I need it to work (and these areas will likely be cached eventually).
However, it 'would' be easier if I could just make queries on a single table, and there is still time to restructure my schema to roll out with the most efficient way to do this. So I'm now wondering if STI is indeed the way to go, and if so, how would I get away with this with different columns on both sides?
UPDATE
I've been thinking heavily about this...namely how I can use STI but have some unique fields for each type. Let's say I have a 'Content' table where I have my shared columns like title, teaser, etc...then I have separate tables like Article, Review where those tables would have my unique columns. Couldn't I technically make those nested attributes of the Content table, or actually the Content table's specific types?
UPDATE 2
So apparently according to what I researched, STI is a bad idea if you need different columns on the different tables, or if you need different controllers for the different content types. So maybe sticking with multiple tables and polymorphs is the right way. Just not sure I'm doing it as effectively as I should be.
UPDATE 3
Thinking mu has the right idea and this is one of those times where the query is so complex that it requires find_by_sql, I created a 'Content' model and made the following call...
#recent_content = Content.find_by_sql("SELECT * FROM articles UNION SELECT * FROM reviews ORDER BY published_at DESC")
And it sort of works...it's returning all types with a single query indeed. The remaining trouble is that because I'm calling ActiveRecord on Content, the following don't work in my collection partial returning recent content...
<%= link_to content.title, content %>
Namely the link path, "content_path doesn't exist." It's tricky because each result returns as a "Content" object, rather than what kind of object it really is. That's because I'm starting the above query with 'Content' of course, but I have to put "something" there to make the activerecord query work. Have to figure out how to return the appropriate path/object type for the different content types, /articles/, /reviews/, etc...

Querying them separately is not as bad as you think. If they have different columns and you need to show different columns in your view, I don't think you can combine them using sql union. For performance issue, you can use cache or optimize your table.

I recently forked a promising project to implement multiple table inheritance and class inheritance in Rails. I have spent a few days subjecting it to rapid development, fixes, commenting and documentation and have re-released it as CITIER Class Inheritance and Table Inheritance Embeddings for Rails.
I think it would do exactly what you need.
Consider giving it a look: http://peterhamilton.github.com/citier
I am finding it so useful! I would (by the way) welcome any help for the community in issues and testing, code cleanup etc! I know this is something many people would appreciate.
Please make sure you update regularly however because like I said, it has been improving/evolving by the day.

Related

Is sql views the proper solution

I have a table named 'Customers' which has all the information about users with different types (user, drivers, admins) and I cannot separate this table right now because it's working on production and this is not a proper time to do this.
so If I make 3 views: the first has users types only, the second has drivers and the third has admins.
My goal is to use 3 models instead one in the project I'm working on so
is this a good solution and what does it cost on performance?
How big is your table 'Customers'? According to the name it doesn't sounds like heavy one.
How often these views will be queried?
Do you have some indices or pk constraints on the attribute you're are going to use in where clause for the views?
I cannot separate this table right now because it's working on
production and this is not a proper time to do this.
From what you said it sounds like a temporarily solution so it probably the good one. Later you сan replace the views with three tables and it will not affect the interface.
I suggest that it is improper to give end-users logins directly into the database. Instead, all requests should go through a database-access layer (API) that users must log into. This layer can provide the filtering you require without (perhaps) any impact to the user. The layer would, while constructing the needed SELECT, tack on, for example, AND type = 'admin' to achieve the goal.
For performance, you might also need to have type at the beginning of some of the INDEXes.

MySQL JOIN vs LIKE - faster selects?

Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

Implementing a database structure for generic objects

I'm building a PHP/MySQL website and I'm currently working on my database design. I do have some database and MySQL experience, but I've never structured a database from scratch for a real world application which hopefully is going to get some good traffic, so I'd love to hear advices from people who've already done it, in order to avoid common mistakes. I hope my explanations are not too confusing.
What I need
In my application, the user should be able to write a post (title + text), then create an "object" (which can be anything, like a video, or a song, etc.) and attach it to the post. The site has a list of predefined object types the user can create, and I should be able to add new types in the future. The user should also have the ability to see the object's details in a dedicated page and add a comment to it - the same applies to posts.
What I tried
I created an objects table with these fields: oid, type, name and date. This table contains records for anything the user should be able to add comments to (i.e. posts and objects). Then I created a postmeta table which contains additional post data (such as text, author, last edit date, etc.), a videometa table for data about the "video" object (URL, description, etc.), and so on. A postobject table (pid,oid) links objects to posts. Additionally, there's a comments table which contains the comment text, the author and the ID of the object it refers to.
Since the list of object types is predefined and is probably not going to change (though I still need the ability to add a type easily at any time without changing the app's code structure or the database design), and it is relatively small, it's not a problem to create a "meta" table for each type and make a corresponding PHP class in my application to handle it.
Finally, a page on the site needs to show a list of all the posts including the objects attached to it, sorted by date. So I get all the records from the objects table with type "post" and join it with postmeta to get the post metadata. Then I query postobject to get all the objects attached to this post, and comments to get all the comments.
The questions
Does this make any sense? Is it any good to design a database in this way for a real world site? I need to join quite a few tables to get all the data I need, and the objects table is going to become huge since it contains almost every item (only the type, name and creation date, though) - this is to keep the database and the app code flexible, but does it work in the real world, or is it too expensive in the long term? Am I thinking about it in the wrong way with this kind of OOP approach?
More specifically: suppose I need to list all the posts, including their attached objects and metadata. I would need to join these tables, at least: posts, postmeta, postobject and {$objecttype}meta (not to mention an users table to get all posts by a specific user, for example). Would I get poor performance doing this, even if I'm using only numeric indexes?
Also, I considered using a NoSQL database (MongoDB) for this project (thanks to Stuart Ellis advice). Apparently it seems much more suitable since I need some flexibility here. But my doubt is: metadata for my objects includes a lot of references to other records in the database. So how would I avoid data duplication if I can't use JOIN? Should I use DBRef and the techniques described here? How do they compare to MySQL JOINs used in the structure described above in terms of performance?
I hope these questions do make any sense. This is my first project of this kind and I just want to avoid to make huge mistakes before I launch it and find out I need to rework the design completely.
I'm not a NoSQL person, but I wonder whether this particular case might actually be handled best with a document database (MongoDB or CouchDB). Various type of objects with metadata attached sounds like the kind of scenario that MongoDB is designed for.
FWIW, you've got a couple of issues with your table and field naming that might bite you later. For example, type and date are rather generic, and also reserved words. You've also mixed singular and plural table names, which will throw any automatic object mapping.
Whichever database you use, it's a good idea to find an existing set of database naming conventions and apply it from the start - this will help you avoid subtle issues and ensure that your naming stays consistent. I tend to use the Rails naming conventions ATM, because they are well-known and fairly sensible.
Or you could store the object contents as a file, outside of the database, if you're concerned about the database space.
If you store anything in the database, you already have the object type in objects; so you could just add object_contents table with a long binary field to store the object. You don't need to create a new table for each new type.
I've seen a lot of JOIN's in real world web application (5 to 10). Objects table may get large, but that's indices are for. So far, I don't see anything wrong in your database. BTW, what felt strange to me - one post, one object, and separate comments for each? No ability to mix pictures with text?

Favouriting things in a database - most efficient method of keeping track?

I'm working on a forum-like webapp where I'd like to allow users to favourite an item so that they can keep track of it, and also so that others can see how many times an item's been favourited.
The problem is, I'm unsure on the best practices for databases, which includes this situation.
I have two ideas in my head on how to do this:
Add an extra column to the user table and store things like so: "|2|5|73|"
Add an extra table with at least two columns, one for referencing an item, the other for referencing a user.
I feel uncomfortable about going for the second method as it involves an extra table, and potentially more queries would be required. Perhaps these beliefs aren't an issue, as I have little understanding of databases beyond simply working with table layouts and basic queries.
The second method, commonly called a junction or join table, is fairly standard practice and is going to be far more efficient than adding a column like the one you describe to the user table. Through the magic of JOINs you won't be making any extra queries.
Since it sounds like your app is starting to get a little complex, I highly recommend picking up a MySQL database book at your local library or book store (check for reviews on Amazon to find a good one) and expanding your knowledge.
Well I'd have +1-ed the otherresponse but I'm too much of a newb apparently. But yes, I recommend a join table for this type of thing.