SQL Database design for statistical analysis of many-to-many relationship - mysql

It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.
I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:
Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.
The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.
So my questions are:
Does this design make sense? What improvements might you make?
What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?
Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.
Thanks!

I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.
It should be reasonably easy to create views for each of your statistics. For example:
CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;
CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);
SELECT
rr.Runner_ID,
COUNT(*) AS Number_of_races,
COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID

1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.
2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible.
If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.
3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.

Related

MYSQL DB Normalization & Query Indexes

We currently have a table that contains 90 columns and as the table is growing and the business needs change, we're having to alter the table alot (add/remove cols & indexes).
|------ (Table name: quotes)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
....
|completed_at|datetime|Yes|NULL
|reviewed_at|datetime|Yes|NULL
|marked_dud_at|datetime|Yes|NULL
|closed_at|datetime|Yes|NULL
|subscribed_at|datetime|Yes|NULL
|admin_checked_at|datetime|Yes|NULL
|priced_at|datetime|Yes|NULL
|number_verified_at|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
|deleted_at|datetime|Yes|NULL
For the application, our staff are constantly querying all sorts of variations on the above data, example being where it has been completed (completed_at), checked (admin_checked_at) and not deleted, reviewed (deleted_at, reviewed_at)
We're thinking it may be easier to offload some of these columns into their own row, we'll call it quotes_actions, then when querying do some joining.
|------ (Table name: quotes_actions)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
|quote_id|int(11)|No|
|action|varchar(100)|No|
|user_id|int(11)|No|
|time|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
An example would be action = 'completed' using the field, with an index covering quote_id and action.
We've split the data into this format on 150,000 rows and it's not any faster nor slower than querying the original database with correct indexes.
Has anyone got any experience with this and has any recommendations or pitfalls for each approach? It's taking a lot of time to add covering indexes and add columns to the original table as we needed them, whereas the second approach has the indexes set up ready to go but is introducing a lot more joins and more complicated queries.
0.09s
select * from `quotes`
where `completed_at` is not null
and `approved_at` is not null
and deleted_at is null
=>
0.0005s
select * from `quotes_new`
inner join quotes_actions as q1 on q1.action = 'completed' and q1.quote_id = quotes_new.id
inner join quotes_actions as q2 on q2.action = 'approved' and q2.quote_id = quotes_new.id
where quotes_new.deleted_at is null
In addition, if the 2nd approach is better, how do you query for negative results, where a quote hasn't been approved?
Database design will vary from application to application, and things that are great for one implementation will be terrible for another. You've identified a few things that are important to you:
speed of data access (at least no reduction in current performance)
ability to respond to application needs/changes
limiting complexity of queries
Without being able to see the entirity of your database and how you are using it, these are the principles I would follow:
Use Stored Procedures and Views for as much as possible
This is just good design. You create an adapter layer between your application and the data tables, which allows you to make whatever changes you need to in the database (and the views/stored procs) without having to change the application itself. Decoupling your systems makes maintenance significantly easier. Also this is good for security, as if the only way outsiders can access the data is through your stored procs, you've eliminated a few avenues of attack. (There's also debate about whether or not the DBMS will cache execution plans for stored procedures, making them execute faster than similar queries, but I'm not a DBA or DBDev, so I'm not touching that).
Attempt to limit width of tables
One thing I've seen time and time again is every time a need arises in a production systems, a column gets added to a table and they call it a day. Far easier than rewriting a bunch of queries or reviewing table structures. This is terrible design. If you've already limited the changes needed to the application layer by following my first piece of advice, you've limited the work needed to actually resolve table changes in the right way. You should always evaluate whether data belongs to the row in question, or if it should be offloaded into its own table. You shouldn't be afraid to radically alter your database, as sometimes it is necessary.
Looking at the data you've provided, I think your second option is okay. You've identified many columns that actually represent the same thing (the "status changes" or as you put it "quote actions" that occur) and offloaded that from the main table to a secondary table. This is perfectly fine, and likely will be effective. You can further "cheat" to make this table faster by offloading status onto its own table, and using an integer to represent it instead of a string (since the string doesn't matter to the database, and integers are far faster to index and search).
This is not to say a wide table is a bad thing, sometimes tables just need to be wide. You just need to evaluate whether the data really belongs to the entity the data row represents.
Approach queries in new ways
You will want to play with the execution plan tools of your DBMS and understand how each query really works. Changing the order of joins can drastically alter the query return speed, and you shouldn't be afraid to use table variables and temp tables in your queries. They are all tools at your disposal.
Querying for Negative Results
Since you asked this question specifically, I'll address it. This requires thinking about your query in a little different way (consequently, if you haven't, you should look into taking a course or working through a textbook of Relational Algebra, it makes understanding databases so much easier).
Your original query made finding something where the quote was not approved easy. It was all in the table: approved_at is null. Simple, easy peasy, no problems. Now, however, instead of being in a column on the main table, it is in its own table, that also represents all the other actions that could be taken. You need to break the problem down a little.
You want to find the set wherein of all orders, there is no action to signify it is approved. In SQL that looks like:
select quote_id from quotes_action where quote_id not in
(select quote_id from quotes_action where action = 'approved');
Final Thoughts
You need to sit down with your team and talk about how you want to move forward with this product. Spend a few days or a couple weeks really thinking deeply about it. Brainstorm....hackathon....do something to find a solution you like and makes your product better and more maintainable. We've all been in the situation where we have an unmaintainable product that could have been fixed at some point, but is beyond that point. Try not to get to that point, and fix it while you have the opportunity.

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

Most efficient way to select lots of values from database

I have a database table with around 2400 'items'. A user will have any arbitrary combination of items from the 2400 item set. Each item's details then need to be looked up and displayed to the user. What is the most efficient way to get all of the item details? I can think of three methods:
Select all 2400 items and parse the required details client-side.
Select the specific items with a SELECT which could be a very long SQL string (0-2400 ids)?
Select each item one at a time (too many connections)?
I'm not clued up on SQL efficiency so any guidance would help. It may help to know this is a web app heavily AJAX based.
Edit: On average a user will select ~150 items and very rarely more than 400-500.
The best method is to return the data you want from the database in a single query:
select i.*
from items i
where i.itemID in (<list of ids>);
MySQL queries can be quite large (see here), so I wouldn't worry about being able to pass in the values.
However, if your users have so many items, I would suggest storing them in the database first and then doing a join to get the additional information.
If the users never/rarely select more than ~50 elements, then I agree with Gordons answer.
If it is really plausible that they might select up to 2400 items, you'll probably be better off by inserting the selected ids into a holding table and then joining with that.
However, a more thorough answer can be found here - which I found through this answer.
He concludes that:
We see that for a large list of parameters, passing them in a temporary table is much faster that as a constant list, while for small lists performance is almost the same.
'Small' and 'large' are hardly static, but dependent upon your hardware - so you should test. My guess would be that with an average of 150 elements in your IN-list, you will see the temp table win.
(If you do test, please come back here and say what is the fastest in your setup.)
2400 items is nothing. I've have a MySql database which has hundreds of thousands of rows and relation and are working perfectly, by just optimizing the queries.
What you must do is, see how long the execution time is for each sql query. You can then optimize each query on its own, trying different querys and measure the execution time.
You can use ex. MySql Workbench, Sequel pro, Microsoft Server management studio or another software for building queries. Also you can add indexes to your tables, which can improve queries as well.
If you need to scale your database up you can use software like http://hadoop.apache.org
Also a great thing to mention is NoSQL (Not-Only SQL). It's a relation-less database, which can handle dynamic attributes and are build for handling large amount of data.
As you mention you could use AJAX. But that only helps the load time of the page and the stress of your web server (Not SQL server). Just ask if you wan't more info or more in-depth explanation.

Rails and queries with complex joins: Can each joined table have an alias?

I'm developing an online application for education research, where I frequently have the need for very complex SQL queries:
queries usually include 5-20 joins, often joined to the same table several times
the SELECT field often ends up being 30-40 lines tall, between derived fields / calculations and CASE statements
extra WHERE conditions are added in the PHP, based on user's permissions & other security settings
the user interface has search & sort controls to add custom clauses to the WHERE / ORDER / HAVING clauses.
Currently this app is built on PHP + MYSQL + Jquery for the moving parts. (This grew out of old Dreamweaver code.) Soon we are going to rebuild the application from scratch, with the intent to consolidate, clean, and be ready for future expansion. While I'm comfortable in PHP, I'm learning bits about Rails and realizing, Maybe it would be better to build version 2.0 on a more modern framework instead. But before I can commit to hours of tutorials, I need to know if the Rails querying system (ActiveRecord?) will meet our query needs.
Here's an example of one query challenge I'm concerned about. A query must select from 3+ "instances" of a table, and get comparable information from each instance:
SELECT p1.name AS my_name, pm.name AS mother_name, pf.name AS father_name
FROM people p1
JOIN mother pm ON p1.mother_id = pm.id
JOIN father pf ON p1.father_id = pf.id
# etc. etc. etc.
WHERE p1.age BETWEEN 10 AND 16
# (selects this info for 10-200 people)
Or, a similar example, more representative of our challenges. A "raw data" table joins multiple times to a "coding choices" table, each instance of which in turn has to look up the text associated with a key it stores:
SELECT d.*, c1.coder_name AS name_c1, c2.coder_name AS name_c2, c3.coder_name AS name_c3,
(c1.result + c2.result + c3.result) AS result_combined,
m_c1.selection AS selected_c1, m_c2.selection AS selected_c2. m_c3.selection AS selected_c3
FROM t_data d
LEFT JOIN t_codes c1 ON d.id = c1.data_id AND c1.category = 1
LEFT JOIN t_menu_choice m_c1 ON c1.menu_choice = m_c1.id
LEFT JOIN t_codes c2 ON d.id = c2.data_id AND c2.category = 2
LEFT JOIN t_menu_choice m_c2 ON c2.menu_choice = m_c2.id
LEFT JOIN t_codes c3 ON d.id = c3.data_id AND c3.category = 3
LEFT JOIN t_menu_choice m_c3 ON c3.menu_choice = m_c3.id
WHERE d.date_completed BETWEEN ? AND ?
AND c1.coder_id = ?
These sorts of joins are straightforward to write in pure SQL, and when search filters and other varying elements are needed, a couple PHP loops can help to cobble strings together into a complete query. But I haven't seen any Rails / ActiveRecord examples that address this sort of structure. If I'll need to run every query as pure SQL using find_by_sql(""), then maybe using Rails won't be much of an improvement over sticking with the PHP I know.
My question is: Does ActiveRecord support cases where tables need "nicknames", such as in the queries above? Can the primary table have an alias too? (in my examples, "p1" or "d") How much control do I have over what fields are selected in the SELECT statement? Can I create aliases for selected fields? Can I do calculations & select derived fields in the SELECT clause? How about CASE statements?
How about setting WHERE conditions that specify the joined table's alias? Can my WHERE clause include things like (using the top example) " WHERE pm.age BETWEEN p1.age AND 65 "?
This sort of complexity isn't just an occasional bizarre query, it's a constant and central feature of the application (as it's currently structured). My concern is not just whether writing these queries is "possible" within Rails & ActiveRecord; it's whether this sort of need is supported by "the Rails way", because I'll need to be writing a lot of these. So I'm trying to decide whether switching to Rails will cause more trouble than it's worth.
Thanks in advance! - if you have similar experiences with big scary queries in Rails, I'd love to hear your story & how it worked out.
Short answer is Yes. Rails takes care of the large part of these requirements through various types of relations, scopes, etc. Most important thing is to properly model your application to support types of queries and functionality you are going to need. If something is difficult to explain to a person, generally will be very hard to do in rails. It's optimized to handle most of "real world" type of relationships and tasks, so "exceptions" become somewhat difficult to fit into this convention, and later become harder to maintain, manage, develop further, decouple etc. Bottom line is that rails can handle sql query for you, SomeObject.all_active_objects_with_some_quality, give you complete control over sql SomeObject.find_by_sql("select * from ..."), execute("update blah set something=''...) and everything in between.
One of advantages of rails allows you to quickly create prototypes, I would create your model concepts, and then test the most complex business requirements that you have. This will give you a quick idea of what is possible and easy to do vs bottlenecks and potential issues that you might face in development.

Data object storage - Can table JOIN's do what single table SELECT's cannot?

Now that "NOSQL" or "object only" storage systems like MongoDB or memcached are really picking up steam in the world. I was wondering if there are any requests that cannot be performed on them that can be performed using multiple object joins (in SQL that is JOIN "table"). In other words, are there any multi-table queries that cannot be handled by several single table queries in a row?
Basically, is there a use-case were a multi-table join cannot be replicated by accessing one table at a time in object based storage systems?
Here are some examples of normal 3NF queries using has_man and has_many_through relations. These aren't the most complex queries - but they should give you a starting point for the concept. Note that any value in {} means a value of the result of the last query.
Company Has Many Users
SELECT user.*, company.name as company_name FROM user
LEFT JOIN company ON company.id = user.company_id
WHERE user.id = 4
vs
SELECT * FROM user WHERE id = 4
SELECT * FROM company WHERE id = {user.comany_id}
Club Has Many Students Through Memberships
SELECT student.* FROM student LEFT JOIN membership on
membership.student_id = sudent.id WHERE membership.club_id = 5
vs
SELECT * FROM membership WHERE club.id = 5
SELECT * FROM student WHERE id = {membership.student_id}
The reason I'm wondering is because I want to know if Object-based systems (that rely on accessing single table objects at a time) can do what RDBMS databases like PostgreSQL or MySQL can do.
So far the only thing wrong seems to be that more queries are necessary.
Just because you can, doesn't mean you should.
The multiple SELECT statement alternative cons:
the less trips to the database, the better. TCP overhead can not be recouped, and it looks like Network Neutrality is officially dead so we could expect to see a movement away from multi-select/nosql because you might have to pay for that bandwidth...
because of delay between initial and subsequent statements, risk of supporting data not reflecting what's in the system when the first query was run
less scalable--the larger the data set, the more work the application is doing to deal with business rules and association that can scale far better in a database
more complexity in the application, which also makes the business less portable (IE: migrate from Java to .NET or vice versa - you're looking at building from scratch when business logic in the DB would minimize that)
1 - running multiple separated queries leaves you with consurrency mess - by the time you got something from table 1 it could have been deleted and it might still be in table 2 - now assume 5 correlated tables.
2 - running queries with at least moderately complex logic over fields that are not mythical ID
3 - controling the amount of data fetched (you hardly ever need more than 50% of the data which is needed to deserialize/create valid objects and even worse whole trees of connected objects)
4 - correlated queries (nested selects) which SQL server will optimize like joins to additive complexity or better (|T1|+|T2|+|T3|+|T4|) while any ORM or nonSQL will have to keep repeating inner queries and giving rise to multiplicative complexity (|T1||T2||T3|*|T4|)
5 - dataset sizes, scalability not just in dataset sizes but also in handling concurrency under updates. Even ORM-s which maintain transactions make them so long that chances for deadlocks increase exponentially.
6 - blind updates (a lot more data touched for no reason) and their dependency and failure based on a blind instrument (mythical version which is realistically needed in say 1% of relational data model but ORM and alikes have to have it everywhere)
7 - lack of any standards and compatibility - this means that your system and data will always be at much higher risk and dependent on software changes driven by academic adventurism rather that any actual business responsibility and with expectation to invest a lot of resources just in testing changes.
8 - data integrity - oops some code just deleted half of today's order records from T1 since there was no foreign key to T2 to stop it. Prefecly normal thing to do with separated queries.
9 - negative maturity trend - keeps splintering instead of standardizing - give it 20 yr and maybe it will get stable
Last but not least - it doesn't reduce any compexity (the same correlation between data is still there) but it makes it very hard to track and manage complexity or have any realistic remedy or transparency when something goes wrong. And it adds the complexity of 1-2 layers. If something goes wrong in your SQL tables you have tools and queries to discover and even fix your data. What are you going to do when some ORM just tells you that it has "invalid pointer" and throws exception since surely you don't want "invalid object" ?
I think that's enough :-)
Actually one of the biggest problems is that some of the NoSQL databases are not transactional across multiple queries.
ORM like Hibernate will do multiple queries with out "joining" sometimes but have the advantage that they are with in the same transaction.
With NoSQL you do not have that luxury.
So this could very easily have misleading results:
SELECT * FROM user WHERE id = 4
SELECT * FROM company WHERE id = {user.comany_id}
If the company for user.company_id is deleted between the two statement calls. This is a well known issue with these databases. So regardless of whether or not you can properly do JOINs the issue will be not having transactions.
Otherwise you can model anything so long as it can store bytes :)
You could nosql like an old fashioned 'hierarchical' database too!
In addition to OMGPonies' answers, reporting is harder to do.
About scaling - that's not right. nosql is meant for scaling, if you use it right.
Another reason to do nosql - if you are doing all your work in objects, going thru o-r mapping to sql, and no work thru complicated (i.e., hand-rolled for efficiency) UPDATE statements. e.g., an update of a join, or update 'where ... in (...)'.
If the database is single-purpose (usu. the case for high volume apps) nosql is more likely to be OK.
Multipurpose - OLTP - Line of Business - go with SQL.
I could go on but this is eating into my lunch break. Not that I would ever eat into work-time. I prefer to just eat during my lunch break.