Is LINQ lazy-evaluated? - linq-to-sql

Greetings, I have the following question. I did not find the exact answer for it, and it's really interesting to me. Suppose I have the following code that retrieves the records from database (in order to export it to XML file, for example).
var result = from emps in dc.Employees
where emps.age > 21
select emps;
foreach (var emp in result) {
// Append this record in suitable format to the end of XML file
}
Suppose there is a million of records that satisfy the where condition in the code. What will happen? All this data will be retrieved from SQL Server immediately to the runtime memory when it reaches the foreach construct, or it will be retrieved then necessary, the first record, second. In other words, does LINQ really handle the situation with iterating through large collections (see my post here for details)?
If not, how to overcome the memory issues in that case? If I really need to traverse the large collection, what should I do? Calculate the actual amount of elements in collection with help of Count function, and after that read the data from the database by small portions. Is there an easy way to implement paging with LINQ framework?

All the data will be retrieved from SQL Server, at one time, and put into memory. The only way around this that I can think of is to process data in smaller chunks (like page using Skip() and Take()). But, of course, this requires more hits to SQL Server.
Here is a Linq paging extension method I wrote to do this:
public static IEnumerable<TSource> Page<TSource>(this IEnumerable<TSource> source, int pageNumber, int pageSize)
{
return source.Skip((pageNumber - 1) * pageSize).Take(pageSize);
}

Yes, LINQ uses lazy evaluation. The database would be queried when the foreach starts to execute, but it would fetch all the data in one go (it would be much less efficient to do millions of queries for just one result at a time).
If you're worried about bringing in too many results in one go, you could use Skip and Top to only get a limited number of results at a time (thus paginating your result).

It'll be retrieved when you invoke ToList or similar methods. LINQ has deferred execution:
http://weblogs.asp.net/psteele/archive/2008/04/18/linq-deferred-execution.aspx
The way - even having deferred execution and loading the entire collection from a data source in the case of an OR/M or any other LINQ provider - will be determined by the implementer of the LINQ object source.
That's, for example, some OR/M may provide lazy-loading, and that means your "entire list of customers" would be something like a cursor, and accessing one of items (an employee), and also one property, will load the employee itself or the accessed property only.
But, anyway, these are the basics.
EDIT: Now I see it's a LINQ-to-SQL thing... Or I don't know if question's author misunderstood LINQ and he doesn't know LINQ isn't LINQ-to-SQL, but it's more a pattern and a language feature.

OK, now thanks to this answer I have an idea - how about mixing the function of taking a page with yield return possibilities? Here is the sample of code:
// This is the original function that takes the page
public static IEnumerable<TSource> Page<TSource>(this IEnumerable<TSource> source, int pageNumber, int pageSize) {
return source.Skip((pageNumber - 1) * pageSize).Take(pageSize);
}
// And here is the function with yield implementation
public static IEnumerable<TSource> Lazy<TSource>(this IEnumerable<TSource> source, int pageSize) {
int pageNumber = 1;
int count = 0;
do {
IEnumerable<TSource> coll = Page(source, pageNumber, pageSize);
count = coll.Count();
pageNumber++;
yield return coll;
} while (count > 0);
}
// And here goes our code for traversing collection with paging and foreach
var result = from emps in dc.Employees
where emps.age > 21
select emps;
// Let's use the 1000 page size
foreach (var emp in Lazy(result, 1000)) {
// Append this record in suitable format to the end of XML file
}
I think this way we can overcome the memory issue, yet leaving the syntaxis of foreach not so complicated.

Related

Do Couchbase reactive clients guarantee order of rows in view query result

I use Couchbase Java SDK 2.2.6 with Couchbase server 4.1.
I query my view with the following code
public <T> List<T> findDocuments(ViewQuery query, String bucketAlias, Class<T> clazz) {
// We specifically set reduce false and include docs to retrieve docs
query.reduce(false).includeDocs();
log.debug("Find all documents, query = {}", decode(query));
return getBucket(bucketAlias)
.query(query)
.allRows()
.stream()
.map(row -> fromJsonDocument(row.document(), clazz))
.collect(Collectors.toList());
}
private static <A> A fromJsonDocument(JsonDocument saved, Class<A> clazz) {
log.debug("Retrieved json document -> {}", saved);
A object = fromJson(saved.content(), clazz);
return object;
}
In the logs from the fromJsonDocument method I see that rows are not always sorted by the row key. Usually they are, but sometimes they are not.
If I just run this query in browser couchbase GUI, I always receive results in expected order. Is it a bug or expected that view query results are not sorted when queried with async client?
What is the behaviour in different clients, not java?
This is due to the asynchronous nature of your call in the Java client + the fact that you used includeDocs.
What includeDocs will do is that it will weave in a call to get for each document id received from the view. So when you look at the asynchronous sequence of AsyncViewRow with includeDocs, you're actually looking at a composition of a row returned by the view and an asynchronous retrieval of the whole document.
If a document retrieval has a little bit of latency compared to the one for the previous row, it could reorder the (row+doc) emission.
But good news everyone! There is a includeDocsOrdered alternative in the ViewQuery that takes exactly the same parameters as includeDocs but will ensure that AsyncViewRow come in the same order returned by the view.
This is done by eagerly triggering the get retrievals but then buffering those that arrive out of order, so as to maintain the original order without sacrificing too much performance.
That is quite specific to the Java client, with its usage of RxJava. I'm not even sure other clients have the notion of includeDocs...

How to map LINQ To SQL to enable eager loading, return EntitySet or ICollection?

This is related (but fairly independent) to my question here: Why SELECT N + 1 with no foreign keys and LINQ?
I've tried using DataLoadOptions to force eager loading, but I'm not getting it to work.
I'm manually writing my LinqToSQL mappings and was first following this tutorial: http://www.codeproject.com/Articles/43025/A-LINQ-Tutorial-Mapping-Tables-to-Objects
Now I've found this tutorial: http://msdn.microsoft.com/en-us/library/bb386950.aspx
There's at least one major difference that I can spot. The first tutorial suggest returning ICollection's and the second EntitySet's. Since I'm having issues I tried to switch my code to return EntitySet's, but then I got issue with needing to reference System.Data.Linq in my Views and Controllers. I tried to do that, but didn't get it to work. I'm also not sure it's a good idea.
At this point, I just want to know which return type I'm supposed to use for a good design? Can I have a good design and still be able to force eager loading in specific cases?
A lot of trial and error finally lead to the solution. It's fine to return ICollection or IList, or in some cases IEnumerable. Some think returning EntitySet or IQueryable is a bad idea, and I agree because it exposes to much of the datasource/technology. Some thing returning IEnumerable is a bad idea and it seems like it depends. The problem beeing that it can be used for lazy loading, which may or may not be a good thing.
One reoccuring issue is that of returning paged results with a count for the total items outside the page. This can be solved by creating a CollectionPage<T> ( http://www.codetunnel.com/blog/post/104/how-to-properly-return-a-paged-result-set-from-your-repository )
More on what to return from repositories here:
http://www.codetunnel.com/blog/post/103/should-you-return-iqueryablet-from-your-repositories
http://www.shawnmclean.com/blog/2011/06/iqueryable-vs-ienumerable-in-the-repository-pattern/
IEnumerable vs IQueryable for Business Logic or DAL return Types
List, IList, IEnumerable, IQueryable, ICollection, which is most flexible return type?
Even more important, DataLoadOptions can do the eager loading! I've now restructured my code so much I'm not 100% sure what I did wrong to cause DataLoadOptions not to work. As far as I've gathered I should get an exception if I tried to add it to the DataContext after the DataContext has been used, which it didn't. What I've found out though is to think in the Unit of Work-pattern. However, for my needs (and because I don't want to return EntitySet or IQueryable from my repositories) I'm not going to implement a cross-repository Unit of Work. Instead I'm just thinking about my repository methods as their own small Unit of Work. I'm sure there's bad things about this (for instance it might cause more round-trips to the database in some update scenarios), and in the future I might reconcider. However it's a simple clean solution.
More info here:
https://stackoverflow.com/a/7941017/1312533
http://www.asp.net/mvc/tutorials/getting-started-with-ef-using-mvc/implementing-the-repository-and-unit-of-work-patterns-in-an-asp-net-mvc-application
This is what I ended up with in my repository:
public class SqlLocalizedCategoriesRepository : ILocalizedCategoriesRepository
{
private string connectionString;
private HttpContextBase httpContext;
public SqlLocalizedCategoriesRepository(string connectionString, HttpContextBase httpContext) // Injected with Inversion of Control
{
this.connectionString = connectionString;
this.httpContext = httpContext;
}
public CollectionPage<Product> GetProductsByLocalizedCategory(string category, int countryId, int page, int pageSize)
{
// Setup a DataContext
using (var context = new DataContext(connectionString)) // Because DataContext implements IDisposable it should be disposed of
{
var dlo = new System.Data.Linq.DataLoadOptions();
dlo.LoadWith<Product>(p => p.ProductSubs); // In this case I want all ProductSubs for the Products, so I eager load them with LoadWith. There's also AssociateWith which can filter what is eager loaded.
context.LoadOptions = dlo;
context.Log = (StringWriter)httpContext.Items["linqToSqlLog"]; // For logging queries, a must so you can see what LINQ to SQL generates
// Query the DataContext
var cat = (from lc in context.GetTable<LocalizedCategory>()
where lc.CountryID == countryId && lc.Name == category
select lc.Category).First(); // Gets the category into memory. Might be some way to not get it into memory by combining with the next query, but in my case my next step is that I'm also going to need the Category anyway so it's not worth doing because I'm going to restructure this code to take a categoryId parameter instead of the category parameter.
var products = (from p in context.GetTable<Product>()
where p.ProductCategories.Any(pm => pm.Category.CategoryID == cat.CategoryID)
select p); // Generates a single query to get the the relevant products, which with DataLoadOptions loads related ProductSubs. It's important that this is just a query and not loaded into memory since we're going to split it into pages.
// Return the results
var pageOfItems = new CollectionPage<Product>
{
Items = products.Skip(pageSize * (page - 1)).Take(pageSize).ToList(), // Gets the page of products into memory
TotalItems = products.Count(), // Get to total count of items belonging to the Category
CurrentPage = page
};
return pageOfItems;
}
}
}

Entity Framework Code First Case Sensitivity on string PK/FK Relationships

I have a fairly simple composite one to many relationship defined using POCO/Fluent API, one column of which is a string.
I've discovered that the data in this column in our database is inconsistent in terms of case ie 'abb', 'ABB' - this is our main ERP system and is fed by a variety of sources which are mainly beyond our control.
This is leading to problems using EF code first when joining to related tables as the join is silently ignored by EF when the case of PK/FK is different even though SQL Profiler shows the correct SQL being executed and results returned.
I'm using WCF so have lazy loading and proxy creation turned off and am eager loading required related entities using Include. eg.
var member = context.Member.Include(m => m.Audits).First(m => m.Id == id);
Are there any solutions to this outside of amending the database schema?
EF Insensitive join comparison
Hi I'm having the same problem (although not wit code first, but with a generated model)
The cause is that EF makes a case-sensitive comparison of the key fields, and it doesn'n find the related objects.
I'm guessing the problem lies in the "EDM Relationship Manager" and maybe there's a possibility of overriding this behavior.
I found a simple workaround for this: lower casing the related properties:
[EdmScalarPropertyAttribute(EntityKeyProperty=true, IsNullable=false)]
[DataMemberAttribute()]
public global::System.String id
{
get
{
return _id.ToLower(); // **<- here**
}
set
{
if (_id != value)
{
OnidChanging(value);
ReportPropertyChanging("id");
_id = StructuralObject.SetValidValue(value, false);
ReportPropertyChanged("id");
OnidChanged();
}
}
}
private global::System.String _id;
partial void OnidChanging(global::System.String value);
partial void OnidChanged();
It actually works, but, of course, it's a lame workoround.
I'm sticking to it for a while util I (or somebody) comes out with a better solution.
Good Luck!
I came up with a workaround that manually "stitches up" the association after the context has retrieved the appropriate rows from the database. Translated to your problem it would be along these lines:
//Your original query
var members = context.Member.Include(m => m.Audits).First(m => m.Id == id);
//The "stitch up" code that should probably be moved to a method of the data context.
var membersWithoutAudits = context.Members.Local.Where(m => !m.Audits.Any()).ToList();
foreach (var nextMember in membersWithoutAudits) {
//Now we can populate the association using whatever logic we like
nextMember.Audits = context.Audits.Local.Where(a => a.MemberId.ToLower() == nextMember.Id.ToLower()).ToList();
}
Notice how we use the context.[DbSet].Local property to ensure that we do all the "stitch up" in memory without making any further SQL calls. I also fetch the members without audits as a performance optimization so we are not re-doing the work of EF's association (in the cases where it did work). But you could just as easily remap every "member" instance.

How do you implement Pipes and Filters pattern with LinqToSQL/Entity Framework/NHibernate?

While building by DAL Repository, I stumbled upon a concept called Pipes and Filters. I read about it here, here and saw a screencast from here. I am still not sure how to go about implementing this pattern. Theoretically all sounds good , but how do we really implement this in an enterprise scenario?
I will appreciate, if you have any resources,tips or examples ro explanation for this pattern in context to the data mappers/ORM mentioned in the question.
Thanks in advance!!
Ultimately, LINQ on IEnumerable<T> is a pipes and filters implementation. IEnumerable<T> is a streaming API - meaning that data is lazily returns as you ask for it (via iterator blocks), rather than loading everything at once, and returning a big buffer of records.
This means that your query:
var qry = from row in source // IEnumerable<T>
where row.Foo == "abc"
select new {row.ID, row.Name};
is:
var qry = source.Where(row => row.Foo == "abc")
.Select(row = > new {row.ID, row.Name});
as you enumerate over this, it will consume the data lazily. You can see this graphically with Jon Skeet's Visual LINQ. The only things that break the pipe are things that force buffering; OrderBy, GroupBy, etc. For high volume work, Jon and myself worked on Push LINQ for doing aggregates without buffering in such scenarios.
IQueryable<T> (exposed by most ORM tools - LINQ-to-SQL, Entity Framework, LINQ-to-NHibernate) is a slightly different beast; because the database engine is going to do most of the heavy lifting, the chances are that most of the steps are already done - all that is left is to consume an IDataReader and project this to objects/values - but that is still typically a pipe (IQueryable<T> implements IEnumerable<T>) unless you call .ToArray(), .ToList() etc.
With regard to use in enterprise... my view is that it is fine to use IQueryable<T> to write composable queries inside the repository, but they shouldn't leave the repository - as that would make the internal operation of the repository subject to the caller, so you would be unable to properly unit test / profile / optimize / etc. I've taken to doing clever things in the repository, but return lists/arrays. This also means my repository stays unaware of the implementation.
This is a shame - as the temptation to "return" IQueryable<T> from a repository method is quite large; for example, this would allow the caller to add paging/filters/etc - but remember that they haven't actually consumed the data yet. This makes resource management a pain. Also, in MVC etc you'd need to ensure that the controller calls .ToList() or similar, so that it isn't the view that is controlling data access (otherwise, again, you can't unit test the controller properly).
A safe (IMO) use of filters in the DAL would be things like:
public Customer[] List(string name, string countryCode) {
using(var ctx = new CustomerDataContext()) {
IQueryable<Customer> qry = ctx.Customers.Where(x=>x.IsOpen);
if(!string.IsNullOrEmpty(name)) {
qry = qry.Where(cust => cust.Name.Contains(name));
}
if(!string.IsNullOrEmpty(countryCode)) {
qry = qry.Where(cust => cust.CountryCode == countryCode);
}
return qry.ToArray();
}
}
Here we've added filters on-the-fly, but nothing happens until we call ToArray. At this point, the data is obtained and returned (disposing the data-context in the process). This can be fully unit tested. If we did something similar but just returned IQueryable<T>, the caller might do something like:
var custs = customerRepository.GetCustomers()
.Where(x=>SomeUnmappedFunction(x));
And all of a sudden our DAL starts failing (cannot translate SomeUnmappedFunction to TSQL, etc). You can still do a lot of interesting things in the repository, though.
The only pain point here is that it might push you to have a few overloads to support different calling patterns (with/without paging, etc). Until optional/named parameters arrives, I find the best answer here is to use extension methods on the interface; that way, I only need one concrete repository implementation:
class CustomerRepository {
public Customer[] List(
string name, string countryCode,
int? pageSize, int? pageNumber) {...}
}
interface ICustomerRepository {
Customer[] List(
string name, string countryCode,
int? pageSize, int? pageNumber);
}
static class CustomerRepositoryExtensions {
public static Customer[] List(
this ICustomerRepository repo,
string name, string countryCode) {
return repo.List(name, countryCode, null, null);
}
}
Now we have virtual overloads (as extension methods) on ICustomerRepository - so our caller can use repo.List("abc","def") without having to specify the paging.
Finally - without LINQ, using pipes and filters becomes a lot more painful. You'll be writing some kind of text based query (TSQL, ESQL, HQL). You can obviously append strings, but it isn't very "pipe/filter"-ish. The "Criteria API" is a bit better - but not as elegant as LINQ.

How can I force Linq to SQL NOT to use the cache?

When I make the same query twice, the second time it does not return new rows form the database (I guess it just uses the cache).
This is a Windows Form application, where I create the dataContext when the application starts.
How can I force Linq to SQL not to use the cache?
Here is a sample function where I have the problem:
public IEnumerable<Orders> NewOrders()
{
return from order in dataContext.Orders
where order.Status == 1
select order;
}
The simplest way would be to use a new DataContext - given that most of what the context gives you is caching and identity management, it really sounds like you just want a new context. Why did you want to create just the one and then hold onto it?
By the way, for simple queries like yours it's more readable (IMO) to use "normal" C# with extension methods rather than query expressions:
public IEnumerable<Orders> NewOrders()
{
return dataContext.Orders.Where(order => order.Status == 1);
}
EDIT: If you never want it to track changes, then set ObjectTrackingEnabled to false before you do anything. However, this will severely limit it's usefulness. You can't just flip the switch back and forward (having made queries between). Changing your design to avoid the singleton context would be much better, IMO.
It can matter HOW you add an object to the DataContext as to whether or not it will be included in future queries.
Will NOT add the new InventoryTransaction to future in memory queries
In this example I'm adding an object with an ID and then adding it to the context.
var transaction = new InventoryTransaction()
{
AdjustmentDate = currentTime,
QtyAdjustment = 5,
InventoryProductId = inventoryProductId
};
dbContext.InventoryTransactions.Add(transaction);
dbContext.SubmitChanges();
Linq-to-SQL isn't clever enough to see this as needing to be added to the previously cached list of in memory items in InventoryTransactions.
WILL add the new InventoryTransaction to future in memory queries
var transaction = new InventoryTransaction()
{
AdjustmentDate = currentTime,
QtyAdjustment = 5
};
inventoryProduct.InventoryTransactions.Add(transaction);
dbContext.SubmitChanges();
Wherever possible use the collections in Linq-to-SQL when creating relationships and not the IDs.
In addition as Jon says, try to minimize the scope of a DataContext as much as possible.