LINQ To Entities and Lazy Loading - linq-to-sql

In a controversial blog post today, Hackification pontificates on what appears to be a bug in the new LINQ To Entities framework:
Suppose I search for a customer:
var alice = data.Customers.First( c => c.Name == "Alice" );
Fine, that works nicely. Now let’s see
if I can find one of her orders:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
LINQ-to-SQL will find the child row.
LINQ-to-Entities will silently return
nothing.
Now let’s suppose I iterate through
all orders in the database:
foreach( var order in data.Orders ) {
Console.WriteLine( "Order: " + order.Item ); }
And now repeat my search:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
Wow! LINQ-to-Entities is suddenly
telling me the child object exists,
despite telling me earlier that it
didn’t!
My initial reaction was that this had to be a bug, but after further consideration (and backed up by the ADO.NET Team), I realized that this behavior was caused by the Entity Framework not lazy loading the Orders subquery when Alice is pulled from the datacontext.
This is because order is a LINQ-To-Object query:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
And is not accessing the datacontext in any way, while his foreach loop:
foreach( var order in data.Orders )
Is accessing the datacontext.
LINQ-To-SQL actually created lazy loaded properties for Orders, so that when accessed, would perform another query, LINQ to Entities leaves it up to you to manually retrieve related data.
Now, I'm not a big fan of ORM's, and this is precisly the reason. I've found that in order to have all the data you want ready at your fingertips, they repeatedly execute queries behind your back, for example, that linq-to-sql query above might run an additional query per row of Customers to get Orders.
However, the EF not doing this seems to majorly violate the principle of least surprise. While it is a technically correct way to do things (You should run a second query to retrieve orders, or retrieve everything from a view), it does not behave like you would expect from an ORM.
So, is this good framework design? Or is Microsoft over thinking this for us?

Jon,
I've been playing with linq to entities also. It's got a long way to go before it catches up with linq to SQL. I've had to use linq to entities for the Table per Type Inheritance stuff. I found a good article recently which explains the whole 1 company 2 different ORM technologies thing here.
However you can do lazy loading, in a way, by doing this:
// Lazy Load Orders
var alice2 = data.Customers.First(c => c.Name == "Alice");
// Should Load the Orders
if (!alice2.Orders.IsLoaded)
alice2.Orders.Load();
or you could just include the Orders in the original query:
// Include Orders in original query
var alice = data.Customers.Include("Orders").First(c => c.Name == "Alice");
// Should already be loaded
if (!alice.Orders.IsLoaded)
alice.Orders.Load();
Hope it helps.
Dave

So, is this good framework design? Or is Microsoft over thinking this for us?
Well lets analyse that - all the thinking that Microsoft does so we don't have to really makes us lazier programmers. But in general, it does make us more productive (for the most part). So are they overthinking or are they just thinking for us?

If LINQ-to-Sql and LINQ-to-Entities came from two different companies, it would be an acceptable difference - there's no law stating that all LINQ-To-Whatevers have to be implemented the same way.
However, they both come from Microsoft - and we shouldn't need intimate knowledge of their internal development teams and processes to know how to use two different things that, on their face, look exactly the same.
ORMs have their place, and do indeed fill a gap for people trying to get things done, but the ORM uses must know exactly how their ORM gets things done - treating it like an impenetrable black box will only lead you to trouble.

Having lost a few days to this very problem, I sympathize.
The "fault," if there is one, is that there's a reasonable tendency to expect that a layer of abstraction is going to insulate from these kinds of problems. Going from LINQ, to Entities, to the database layer, doubly so.
Having to switch from MS-SQL (using LingToSQL) to MySQL (using LinqToEntities), for instance, one would figure that the LINQ, at least, would be the same if not just to save from the cost of having to re-write program logic.
Having to litter code with .Load() and/or LINQ with .Include() simply because the persistence mechanism under the hood changed seems slightly disturbing, especially with a silent failure. The LINQ layer ought to at least behave consistently.
A number of ORM frameworks use a proxy object to dynamically load the lazy object transparently, rather than just return null, though I would have been happy with a collection-not-loaded exception.
I tend not to buy into the they-did-it-deliberately-for-your-benefit excuse; other ORM frameworks let you annotate whether you want eager or lazy-loading as needed. The same could be done here.

I don't know much about ORMs, but as a user of LinqToSql and LinqToEntities I would hope that when you try to query Orders for Alice it does the extra query for you when you make the linq query (as opposed to not querying anything or querying everything for every row).
It seems natural to expect
from o in alice.Orders where o.Item == "Item_Name" select o
to work given that's one of the reasons people use ORM's in the first place (to simplify data access).
The more I read about LinqToEntities the more I think LinqToSql fulfills most developers needs adequately. I usually just need a one-to-one mappingn of tables.

Even though you shouldn't have to know about Microsoft's internal development teams and processes, fact of the matter is that these two technologies are two completely different beasts.
The design decision for LINQ to SQL was, for simplicity's sake, to implicitly lazy-load collections. The ADO.NET Entity Framework team didn't want to execute queries without the user knowing so they designed the API to be explicitly-loaded for the first release.
LINQ to SQL has been handed over to ADO.NET team and so you may see a consolidation of APIs in the future, or LINQ to SQL get folded into the Entity Framework, or you may see LINQ to SQL atrophy from neglect and eventually become deprecated.

Related

Is MongoDB good for handling SQL-type data?

I have a rather huge application storing data in MongoDB (Mongoose) despite the fact my data is absolutely sequel and can be presented as tables with schemas very well. The specific is I have a lot of relations between objects. So I need to perform very deep populations — 25+ for each request in total.
A good way is to rewrite app for MySQL. However there are tonnes of code binded on MongoDB. The question is: if there will be growing amount of relations between objects by ObjectID, will it be still so efficient as MySQL or should I dive into code and move app complete to MySQL?
In both cases I use ORM. Now Mongoose, if I move — Sequelize.
Is Mongo really efficient in working with relations? I mean, SQL was designed to join tables with relations, I hope it has some optimisations undercover. Relations for Mongo seem to be a bit unusual usecase. So, I worry if logically the same query for gathering data from 25 collections in Mongo or join data from 25 tables in MySQL may be slower for Mongo.
Here's the example of Schema I'm using. Populated fields are marked with *.
Man
-[friends_ids] --> [Man]*
-friends_ids*: ...
-pets_ids*: ...
-...
-[pets_ids] -> [Pet]*
-name
-avatars*: [Avatar]
-path
-size
-...
My thoughts about relations. Lets imagine Man object that should have [friends] field. Let take it out.
MySQL ORM:
from MANS table find Man where id=:id.
from MAN-TO-MANS table find all records where friend id = :id of Man from step 1
from MANS table find all records where id = :id of Men from step 2
join it into one Man object with friends field populated
Mongo:
from MANS collection find Man where _id=:_id. Get it's friends _id's array on this step (non populated)
from MANS collection find all documents where _id = :_id of Men from step 1
join it into one Man object with friends field populated
No requestes to JOIN tables. Am I right?
So I need to perform very deep populations — 25+ for each request in total.
A common misconception is that MongoDB does not support JOINs. While this is partially true it is also quite untrue. The reality is that MongoDB does not support server-side joins.
The MongoDB motto is client side JOINing.
This motto can work against you; the application does not always understand the best way to JOIN as such you have to pick your schema, queries and JOINs very carefully in MongoDB to ensure that you are not querying inefficiently.
25+ is perfectly possible for MongoDB, that's not the problem. The problem will be what JOINs you are doing.
This leads onto:
Is Mongo really efficient in working with relations?
Let me give you an example of where MongoDB would actually be faster than MySQL.
Imagine you have a group collection with each group document containing a user_ids field which is represented as an array of ObjectIds which directly relate to the _id field in the user collection.
Doing two queries, one for the group and one for the users would likely be faster than MySQL in this specific case since MongoDB, for one, would not need to atomically write out a result set using your IO bandwidth for common tasks.
This being said though, anything complex and you will get hammered by the fact that the application does not truly know how to use index inter-sectioning and merging to create a slightly performant JOIN.
So for example say you wish to JOIN between 3 tables in one query paginating by the 3 JOINed table. That would probably kill MongoDBs performance while not being such an inefficient JOIN to perform.
However, you might also find that those JOINs are not scalable anyway and are in fact killing any performance you get on MySQL.
if there will be growing amount of relations between objects by ObjectID, will it be still so efficient as MySQL or should I dive into code and move app complete to MySQL?
Depends on the queries but I have at least given you some pointers.
Your question is a bit broad, but I interpret it in one of two ways.
One, you are saying that you have references 25 levels deep, and in that case using populate is just not going to work. I dearly hope this is not the pickle you find yourself in. Moving to SQL won't help you either, the fact is you'll be going back to the database too many times no matter what. But if this is how it's got to be, you can tackle it using a variation of the materialized path pattern, which will allow you to select subtrees much more efficiently within your very deep data tree. See here for a discussion: http://docs.mongodb.org/manual/tutorial/model-tree-structures-with-materialized-paths/
The other interpretation is that you have 25 relations between collections. Let's say in this case there is one collection in Mongo for every letter of the English alphabet, and documents in collection A have references to one or more documents in each of collections B-Z. In this case, you might be ok. Mongoose populate lets you populate multiple reference paths, and I doubt if there is a limit it is anywhere as low as 25. So you'd do something like docA.populate("B C ... Z"). In this case also, moving to SQL won't help you per se, you'll still be required to join on multiple tables.
Of course, your original statement that this could all be done in SQL is valid, there doesn't seem to have been a specific reason to use (or not use) Mongo here, just seems to be the way things were done. However, it also seems that whether you use NoSQL or SQL approaches here isn't the determining factor in whether you will see inefficiency. Rather, it's whether you model the data correctly within whatever solution you choose.

Complex filtering in rails app. Not sure complex sql is the answer?

I have an application that allows users to filter applicants based on very large set of criteria. The criteria are each represented by boolean columns spanning multiple tables in the database. Instead of using active record models I thought it was best to use pure sql and put the bulk of the work in the database. In order to do this I have to construct a rather complex sql query based on the criteria that the users selected and then run it through AR on the db. Is there a better way to do this? I want to maximize performance while also having maintainable and non brittle code at the same time? Any help would be greatly appreciated.
As #hazzit said, it is difficult to answer without much details, but here's my two cents on this. Raw SQL is usually needed to perform complex operations like aggregates, calculations, etc. However, when it comes to search / filtering features, I often find using raw SQL overkill and not quite maintainable.
The key question here is : can you break down your problem in multiple independent filters ?
If the answer is yes, then you should leverage the power of ActiveRecord and Arel. I often find myself implementing something like this in my model :
scope :a_scope, ->{ where something: true }
scope :another_scope, ->( option ){ where an_option: option }
scope :using_arel, ->{ joins(:assoc).where Assoc.arel_table[:some_field].not_eq "foo" }
# cue a bunch of scopes
def self.search( options = {} )
output = relation
relation = relation.a_scope if options[:an_option]
relation = relation.another_scope( options[:another_option] ) unless options[:flag]
# add logic as you need it
end
The beauty of this solution is that you declare a clean interface in which you can directly pour all the params from your checkboxes and fields, and that returns a relation. Breaking the query into multiple, reusable scopes helps keeping the thing readable and maintainable ; using a search class method ties it all together and allows thorough documentation... And all in all, using Arel helps securing the app against injections.
As a side note, this does not prevent you from using raw SQL, as long as the query can be isolated inside a scope.
If this method is not suitable to your needs, there's another option : use a full-fledged search / filtering solution like Sunspot. This uses another store, separate from your db, that indexes defined parts of your data for easy and performant search.
It is hard to answer this question fully without knowing more details, but I'll try anyway.
While databases are bad at quite a few things, they are very good at filtering data, especially when it comes to a high volumes.
If you do the filtering in Ruby on Rails (or just about any other programming language), the system will have to retrieve all of the unfiltered data from the database, which will cause tons of disk I/O and network (or interprocess) traffic. It then has to go through all those unfiltered results in memory, which may be quite a burdon on RAM and CPU.
If you do the filtering in the database, there is a pretty good chance that most of the records will never be actually retrieved from disk, won't be handed over to RoR and won't then be filtered. The main reason for indexes to even exist is for the sole purpose of avoiding expensive operations in order to speed things up. (Yes, they also help maintain data integrity)
To make this work, however, you may need to help the database a bit to do its job efficiently. You will have to create indexes matching your filtering criteria, and you may have to look into performance issues with certain types of queries (how to avoid temporary tables and such). However, it is definately worth it.
Having that said, there actually are a few types of queries that a given database is not good at doing. Those are few and far between, but they do exist. In those cases, an implementation in RoR might be the better way to go. Even without knowing more about your scenario, I'd say it's a pretty safe bet that your queries are not among those.

Using DataContext cache to preload child collection

In an attempt to reduce round-trips to the database, I was hoping to 'preload' child collections for a parent object. My hope was that if I loaded the objects that make up the child collection into the DataContext cache, Linq2SQL would use those objects instead of going to the database.
For example, assume I have a Person objects with two child collections: Children and Cars.
I thought this might work:
var children = from p in dbc.Person
select p.Children;
var cars = from p in dbc.Person
select p.Cars;
var people = from p in dbc.Person
select p;
var dummy1 = children.ToList();
var dummy2 = cars.ToList();
foreach(var person in people){
Debug.WriteLine(person.Children.Count);
}
But instead, I'm still getting one trip to the database for every call to person.Children.Count, even though all the children are already loaded in the DataContext.
Ultimately, what I'm looking for is a way to load a full object graph (with multiple child collections) in as few trips to the database as possible.
I'm aware of the DataLoadOptions class (and LoadWith), but it's limited to one child collection, and that's a problem for me.
I wouldn't mind using a stored procedure if I was still able to build up the full object graph (and still have a Linq2SQL object, of course) without a lot of extra manipulation of the objects.
I don't think what you require is directly possible with LINQ-SQL or even SQL in the way that your expecting.
When you consider how SQL works, a single one-many relationship can easily be flattened by an inner/left join. After this, if you want another set of objects including, you could theoretically write another left join which would bring back all the rows in the other table. But imagine the output SQL would provide, it's not easily workable back into an object graph. ORM's will typically start to query per row to provide this data. An example of this is to write multiple level DataLoadOptions using LINQ-SQL, you will start to see this.
Also, the impact on performance with many LEFT JOINs will easily outweigh the perceived benefits of single querying.
Consider your example. You are fetching ALL of the rows back from these two tables. This poses two problems further down the line:
The tables may get much more data in them than you expect. Pulling back 1000's of rows in SQL all at once may not be good for performance. You then expect LINQ-SQL to search the lists for matching objects. Dependent on use, I imagine SQL can provide this data faster.
The behavior you expect is hidden, and to me would be unusual that running a SELECT on a large table could potentially bloat the memory usage of the app by caching all of the rows. Maybe they could include it as an option or provide your own extension methods.
Solution
I would personally start to look at caching the data outside of LINQ-SQL, and then retrieve objects from the ASP.Net / Memory Cache / Your Cache provider, if it is deemed that the full object graph is expensive to create.
One level of DataLoadOptions plus relying on the built-in entity relationships + some manual caching will probably save many a headache.
I have come across this particular problem a few times and haven't thought of anything else yet.

Abstracted JOIN for maintainability?

Does anyone know an ORM that can abstract JOINs? I'm using PHP, but I would take ideas from anywhere. I've used Doctrine ORM, but I'm not sure if it supports this concept.
I would like to be able to specify a relation that is actually a complicated query, and then use that relation in other queries. Mostly this is for maintainability, so I don't have a lot of replicated code that has to change if my schema change. Is this even possible in theory (at least for some subset of "complicated query")?
Here's an example of what I'm talking about:
ORM.defineRelationship('Message->Unresponded', '
LEFT JOIN Message_Response
ON Message.id = Message_Response.Message_id
LEFT JOIN Message AS Response
ON Message_Response.Response_id = Response.id
WHERE Response.id IS NULL
');
ORM.query('
SELECT * FROM Message
SUPER_JOIN Unresponded
');
Sorry for the purely invented syntax. I don't know if anything like this exists. It would certainly be complicated if it did.
One possibility would be to write this join as a view in the database. Then you can use any query tools on the view.
Microsofts Entity Framework also supports very complex mappings between code entities and the database tables, even crossing databases. The query you've given as an example would be easily supported in terms of mapping from that join of tables to an entity. You can then execute further queries against the resulting joined data using LINQ. Of course if you're using PHP this may not be a huge amount of use to you.
However I'm not aware of a product that wraps up the join into the syntax of further queries in the way you've shown.

Challenges with Linq to sql concept in dot net

Let say if I used the Linq to Sql concept to interact with database from C# language , then what challenges I may be face? means in terms of architecture, performance , type safety, objects orientation etc ..!
Basically Linq to SQL generates a class for each table in your database, complete with relation properties and all, so you will have no problems with type safety. The use of C# partials allows you to add functionality to these objects without messing around with Linq to SQLs autogenerated code. It works pretty well.
As tables map directly to classes and objects, you will either have to accept that your domain layer mirrors the database design directly, or you will have to build some form of abstraction layer above Linq to SQL. The direct mirroring of tables can be especially troublesome with many-to-many relations, which is not directly supported - instead of Orders.Products you get Order.OrderDetails.SelectMany(od => od.Product).
Unlike most other ORMs Linq to SQL does not just dispense objects from the database and allow you to store or update objects by passing them back into the ORM. Instead Linq to SQL tracks the state of objects loaded from the database, and allows you to change the saved state. It is difficult to explain and strange to understand - I recommend you read some of Rick Strahls blogposts on the subject.
Performance wise Linq-to-SQL does pretty good. In benchmarking tests it shows speeds of about 90-95% of what a native SQL reader would provide, and in my experience real world usage is also pretty fast. Like all ORMs Linq to SQL is affected by the N+1 selects problem, but it provides good ways to specify lazy/eager loading depending on context.
Also, by choosing Linq to SQL you choose MSSQL - there do exist third party solutions that allow you to connect to other databases, but last time I checked, none of them appeared very complete.
All in all, Linq to SQL is a good and somewhat easy to learn ORM, which performs okay. If you need features beyond what Linq to SQL is offering, take a look at the new entity framework - it has more features, but is also more complex.
We've had a few challenges, mainly from opening the query construction capability to programmers that don't understand how databases work. Here are a few smells:
//bad scaling
//Query in a loop - causes n roundtrips
// when c roundtrips could have been performed.
List<OrderDetail> od = new List<OrderDetail>();
foreach(Customer cust in customers)
{
foreach(Order o in cust.Orders)
{
od.AddRange(dc.OrderDetails.Where(x => x.OrderId = o.OrderId));
}
}
//no seperation of
// operations intended for execution in the database
// from operations intended to be executed locally
var query =
from c in dc.Customers
where c.City.StartsWith(textBox1.Text)
where DateTime.Parse(textBox2.Text) <= c.SignUpDate
from o in c.Orders
where o.OrderCode == Enum.Parse(OrderCodes.Complete)
select o;
//not understanding when results are pulled into memory
// causing a full table load
List<Item> result = dc.Items.ToList().Skip(100).Take(20).ToList();
Another problem is that one more level of separation from the table structures means indexes are even easier to ignore (that's a problem with any ORM though).