Using Linq-to-SQL preload without joins - linq-to-sql

Like many people, I am trying to squeeze the best performance out of my app while keeping the code simple and readable as possible. I am using Linq-to-SQL and am really trying to keep my data layer as declarative as possible.
I operate on the assumption that SQL calls are the most expensive operations. Thus, I try to minimize them in quantity, but try to avoid crazy complex queries that are hard to optimize.
Case in point: I am using DataLoadOptions with my DataContext -- its goal is to minimize the quantity of queries by preloading related entities. (Aka, eager loading vs lazy loading.)
Problem: Linq uses joins to achieve the goal. As with everything, it's a trade-off. I am getting fewer queries, but those joined queries are more complex and expensive. Going into SQL Profiler makes this clear.
So, I'd like an option in Linq to preload without joins. Is this possible? Here's what it might look like:
I have a Persons table, an Items table, and a PersonItems table to provide a many-to-many relationship. When loading a collection of Persons, I'd like to have all their PersonItems and Items eagerly loaded as well.
Linq currently does this with one large query, containing two joins. What I'd rather it do is three non-join queries: one for Persons, one for all the PersonItems relating to those Persons, and one for all Items relating to those PersonItems. Then Linq would automagically arrange them into the related entities.
Each of these would be a fast, firehose-type query. Over the long haul, it would allow for predictable, web-scale performance.
Ever seen it done?

I believe what you describe where three non-join queries are done is essentially what happens under the hood when a single join query is performed. I could be wrong but if this the case the single query will be more efficient as only one database query is involved as opposed to three. If you are having performance issues I'd make sure the columns you are joining on are indexed (you should see no table scans in SQL profiler). If this is not enough you could write a custom stored procedure to get just the data you need (assuming you don't need every column of every object, this will allow you to make use of index seeks which are faster than index scans), or alternately you could denormalise (duplicate data across your tables) so that no joining occurs at all.

Related

What is the optimal solution, use Inner Join or multiple queries?

What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]
Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.
This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!
Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.
In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.
One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

Joining Tables Based on Foreign Keys

I have a table that has a lot of fields that are foreign keys referencing a related table. I am writing a script in PHP that will do the db queries. When I query this table for its data I need to know the values associated with these keys not the key.
How do most people go about this?
A 101 way to do this would be to query this table for its data including the foreign keys and then query the related tables to get each key's value. This could be a lot of queries (~10).
Question 1: I think I could write 1 query with a bunch of joins. Would that be better?
This approach also requires the querying script to know which table fields are foreign keys. Since I have many tables like this but all with different fields, this means writing nice generic functions is hard. MySQL InnoDB tables allow for foreign constraints. I know the database has these set up correctly.
Question 2: What about the idea of querying the table and identifying what the constraints are and then matching them up using whatever process I decide on from Question 1. I like this idea but never see it being used in code. Makes me think its not a good idea for some reason. I would use something like SHOW CREATE TABLE tbl_name; to find what constraints/relationships exist for that table.
Thank you for any suggestions or advice.
You talk about writing "nice generic functions", but I think you are thinking a little TOO generic here.
Personally I would just write a query with a bunch of joins in it. If you want to abstract all that join logic away and not have to worry about it, then you should probably look at using an ORM instead of writing the SQL directly.
At some level, the system should run queries using joins, whether those queries are written explicitly by the application programmer or generated automatically by the data access layer. Option 1 is definititely better than the naive option. As for some other query creation options (by no means an exhaustive list):
You could abstract out all database operations, much as PDO abstracts out connecting and query operations (i.e. preparing & executing queries). Use this to get table metadata, including foreign keys, which could then be used to construct queries automatically.
You could write object specifications in some other format (e.g. XML) and a class that would use that to both generate PHP classes and database tables. You find this more in Enterprise applications than smaller projects. This option has more overhead than others, and thus isn't suitable if you only have a few classes to model. Occurrences of this option might also be a consequence of Conway's Law, which I first heard as Richard Fairly's variant: "The structure of a system reflects the structure of the organization that built it."
You could take a LINQ-like approach. In PHP, this would mean writing numerous functions or methods that the application programmer can chain together which would create a query. The application programmers are ultimately responsible for joining tables, though they never write a JOIN themselves.
Rather than thinking about how to create the queries, a better problem approach is to think about how to interface the DB and the application. This leads to patterns such as Data Mapper and Active Record that fall into the category of Object-Relational mapping (ORM). Note that some patterns (such as AR), other ORM techniques and even ORM itself have issues of their own. Any of the above query creation options can be used in the implementation of a data access pattern.
The problem with using SHOW CREATE TABLE is it doesn't work with most (all?) other RDBMSs. If you want to marry your app to MySQL, go ahead, but the decision could haunt you.
What kind of record counts are you working with, both in the main data table(s) and the lookup tables?
As a general rule, you should join the lookup tables to the main table. If you have an excessive amount of joins and there aren't many UDFs involved here, there's a pretty good chance the table should be normalized a bit more. If the normalization is fine and the main data table is really wide, you could still split the table to multiple tables with 1:1 relationships so as to separate the frequently accessed columns from the infrequently accessed columns.
MySQL includes support for the ANSI catalog INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS. You could use that to gather information on the FK relationships that exist.
Also, if there are combinations of joins you use frequently, create a views or stored procedures based on those common operations.

Is it better to return one big query or a few smaller ones?

I'm using MySQL to store video game data. I have tables for titles, platforms, tags, badges, reviews, developers, publishers, etc...
When someone is viewing a game, is it best to have have one query that returns all the data associated with a game, or is it better to use several queries? Intuitively, since we have reviews, it seems pointless to include them in the same query since they'll need to be paginated. But there are other situations where I'm unsure if to break the query down or use two queries...
I'm a bit worried about performance since I'm now joining to games the following tables: developers, publishers, metatags, badges, titles, genres, subgenres, classifications... to grab game badges, (from games_badges; many-to-many to games table, and many to many to badges table) I can either do another join, or run a separate query.... and I'm unsure what is best....
It is significantly faster to use one query than to use multiple queries because the startup of a query and calculation of the query plan itself is costly and running multiple queries in a row slows the server more each time. Obviously you should only get the data that you actually need, but fewer queries is always better.
So if you are going to show 20 games on a page, you can speed up the query (still using only one query) with a LIMIT clause and only run that query again later when they get to the next page. That or you can just make them wait for the query to complete and have all of the data there at once. One big wait or several little waits.
tl;dr use as few queries as possible.
There is no panacea.
Always try to get only necessary data.
There is no answer whether one big or several small queries is better. Each case is unique and to answer this question you should profile your application and examine queries' EXPLAINs
This is generally a processing problem.
If making one query would imply retrieving thousands of entries, call several queries to have MySQL do the processing (sums, etc.).
If making multiple queries involves making tens or hundreds of them, then call a single query.
Obviously you're always facing both of these since neither is a goto option if you're asking the question, so the choices really are:
Pick the one you can take the hit on
Cache or mitigate it as much as you can so that you take a hit very rarely
Try to insert preprocessed data in the database to help you process the current data
Do the processing as part of a cron and have the application only retrieve the data
Take a few steps back and explore other possible approaches that don't require the processing

Database structure - To join or not to join

We're drawing up the database structure with the help of mySQL Workbench for a new app and the number of joins required to make a listing of the data is increasing drastically as the many-to-many relationships increases.
The application will be quite read-heavy and have a couple of hundred thousand rows per table.
The questions:
Is it really that bad to merge tables where needed and thereby reducing joins?
Should we start looking at horizontal partitioning? (in conjunction with merging tables)
Is there a better way then pivot tables to take care of many-to-many relationships?
We discussed about instead storing all data in serialized text columns and having the application make the sorting instead of the database, but this seems like a very bad idea, even though that the database will be heavily cached. What do you think?
Go with the normalized form of the database. For most part of the tasks you won't need more than 3 or 4 Joins and you still can write views for the most common joins. Denormalization will have you to always think of updating fields in multiple places/tables when changing one property and will surely lead to more problems than benefits.
If you worry about reporting performance then you still can extract the data in timed batches into separate tables to get the desired performance for your reporting queries. If it's for query simplicity you can use views.
In inverse order:
Forget it. Use the database. People saynig "make it in the application" are pretty often those ignorant to the amount of work going into writing databases.
Depends on exact need.
Depends on exact need. OLTP (Transaction processing) - go for for firth normal form. OLAP (Analytical processing) - go for a proper star diagram and denormalize to get optimal performance. Mixed - forget it. Does not work for larger installs because the theories are different... except if you make the database OLTP and then use a special OLAP cube database (which mySQL does not have).
Databases are designed to handle lots of joins. Use this feature as it will make many kinds of data manipulation in the database much easier. Otherwise, why not just use a flat file?
As always, it depends on your application, but in general, too much denormalisation can come back and bite you later on. A well normalised database means that you should be able to query your data in most ways that you may need later on, particularly for reporting (which often is an afterthought).
If you stick all your data in serialized text columns and your client asks for a report showing all rows that have a particular attribute, then you're going to have to do a bunch of string manipulation to get this data out.
If you're worried about too many joins for your queries, you could consider exposing certain sets of the data as a view...
If you make sure to index the foreign keys (you did set up foreign keys didn't you?) and have proper where clauses in your queries, 10-15 joins should be easily handled by a database. Especially with so few rows. I have queries with that many joins on tables with millions of rows and they run fine.
Usually it is better to partition data than to denormalize.
As far as denomalizing goes, don't do it unless you also institute a strategy for keeping the denormalized data in synch with the parent table.
As to whether you really need that many tables or if your design is bad, well the only way we could comment on that is if we saw the table structure.
Unless you have clear evidence that performance is suffering because of the joins, stay normalised. Otherwise, as others have said, you'll have to worry about multiple updates.
Especially if the database is heavily cached, as you say, you'll be surprised how quick the DBMS is at doing this kind of thing - it is what it's designed for, after all.
Unless it's the sort of monster application, with huge amounts of data, that demands special performance optimisations, you'll find that keeping down the development, testing, and later, maintenance effort, will be much more important.
Joins are good, usually, not bad. They allow you to keep the data where it should be, which gives you maximum flexibility.
And as has been said many times, premature optimisation is usually bad, not good.