If we abstract out the DataContext, then are L2S and L2O queries identical?
I already have a working prototype which demonstrates this, but it is very simple and wonder if it will hold up to more advanced querying.
Does anyone know?
No they're not the same.
LINQ to Objects queries operate on IEnumerable<T> collections. The query iterates through the collection and executes a sequence of methods (for example, Contains, Where etc) against the items in the collection.
LINQ to SQL queries operate on IQueryable<T> collections. The query is converted into an expression tree by the compiler and that expression tree is then translated into SQL and passed to the database.
It's quite commonplace for LINQ to SQL to complain that a method can't be translated into SQL, even though that method works perfectly in a LINQ to Objects query. (In other cases, you may not see an exception, but the query results might be subtly different between LINQ to Objects and LINQ to SQL.)
For example, LINQ to SQL will choke on this simple query, whereas LINQ to Objects will be fine:
var query = from n in names
orderby n.LastName.TrimStart(',', ' ').ToUpper(),
n.FirstName.TrimStart(',', ' ').ToUpper()
select new { n.FirstName, n.LastName };
(It's often possible to workaround these limitations, but the fact that you can't guarantee that any arbitrary LINQ to Objects query will work as a LINQ to SQL query tells me that they're not the same!)
Frustratingly, all IQueryably<T> implementations are, essentially, leaky abstractions - and it is not safe to assume that something that works in LINQ-to-Objects will still work under any other provider. Apart from the obvious function mappings, things like:
LINQ-to-SQL can't possibly support all functions / overloads - listed here Data Types and Functions (LINQ to SQL)
plus it depends on the actual database server; Skip/Take etc work differently on SQL Server 2000 than 2005+, and not every such translation works on SQL Server 2000
EF doesn't support Single or Expression.Invoke (sub-expression invocation), or UDF usage
Astoria supports different use of Single/First; as I recall it supports Where(pred).Single() - but not Single(pred) (which is the preferred usage for LINQ-to-SQL)
So you can't really use IEnumerable<T> for your unit tests simulating a database, even via AsQueryable() - it simply isn't robust. Personally, I keep IQueryable<T> and Expression away from the repository interface for this reason - see Pragmatic LINQ.
The query syntax is the same. If you use Enumerable.ToQuerable, even the types are the same. But there are some differences:
some queries will only work on L2O and will result in an runtime error in L2S (e.g. if an expression tree contains a function that cannot be converted to SQL. This cannot be detected at compile time)
some queries return different results on L2S and L2O (example: Max([empty sequence]) will throw an exception in L2O but return null in L2S)
So in the end, you will have to test against a database to be sure, but I think L2O is pretty good for simple, fast unit-tests.
Related
I want to request a high amount of records (100000 to 1000000) per select request with a join of three tables. Is the performance much better with nativeSQL instead of using spring-data-jpa for mapping it to #Entity objects?
Thx!
JPA and every ORM turn your query results into domain objects.
That of course takes resources. Spring Data JPA adds potential conversions to that and it preprocesses your query in order to support fancy ways of setting parameters.
If you are selecting large amounts of data the preprocessing of the statement probably doesn't matter that much.
But the conversion to domain objects will.
You used the word "migrating" which sounds like you are going to select data and then immediately write it somewhere else. If that is the case, use plain SQL and work directly on the ResultSet tell the driver to make it read only and forward only. See Understanding Forward Only ResultSet
Here is a simple MySQL query i want to use in a Symfony2 project :
SELECT * FROM
(
SELECT n.sdate, n.edate FROM `news` n
UNION
SELECT ss.sdate, ss.edate FROM `stagesession` ss
) AS sub
ORDER BY sub.sdate
In fact, this query will be a little more complicated, with more aliases, filter and joins with other tables.
Do I have to convert it in a DQL query, with the createQueryBuilder, or the best way is simply to use createNativeQuery from doctrine ?
My personal Best Practice with Doctrine is:
Query (QB vs. DQL vs. SQL):
use QB if building your query is more conditional than just passing some parameters, like if($onlyActive) $qb->andWhere('x.type = 5'); (I don't like string concat stuff)
use QB for compatibility reasons to pagination toolkits
use DQL for simple selects
use SQL if DQL-query not possible (e.g. DB-native expressions MySQL/Oracle/MSSQL, some weird statistics or hacky queries with UNION or huge subqueries)
at least you can also use SQL, if you're using a small data subset of a very huge DB (like writing some plugin software), because else if the database schema is quite small, you could create some entities from it and revalidate them (for example when you deploy) as a system-test. But if it's too complicated then QB or DQL would also be overkill for accessing such a database, because you have to define entities to work.
Result (orm vs. flat):
use ORM in business code wherever possible to have max. readable code (consider lazy loading)
use ORM in complicated nested views (no huge tables) to have nice clean code in your template (consider eager loading)
use flat arrays for read-only tables/lists
use flat arrays for optimization reasons when dealing with lot's of data (and caching not possible)
And always keep in mind, that you should first write simple code and iff it's to slow, optimize it with eager/lazy loading, Query/Result caching, HTTP caching and at least if you e.g. deal with some database synchronization or data importer you may have use flat arrays or fall back to native implementations, but don't underrate ORM ;).
SQL parameterization is a hot topic nowadays, and for a good reason, but does it really do anything besides escaping decently?
I could imagine a parameterization engine simply making sure the data is decently escaped before inserting it into the query string, but is that really all it does? It would make more sense to do something differently in the connection, e.g. like this:
> Sent data. Formatting: length + space + payload
< Received data
-----
> 69 SELECT * FROM `users` WHERE `username` LIKE ? AND `creation_date` > ?
< Ok. Send parameter 1.
> 4 joe%
< Ok. Send parameter 2.
> 1 0
< Ok. Query result: [...]
This way would simply eliminate the issue of SQL injections, so you wouldn't have to avoid them through escaping. The only other way I can think of how parameterization might work, is by escaping the parameters:
// $params would usually be an argument, not in the code like this
$params = ['joe%', 0];
// Escape the values
foreach ($params as $key=>$value)
$params[$key] = mysql_real_escape_string($value);
// Foreach questionmark in the $query_string (another argument of the function),
// replace it with the escaped value.
$n = 0;
while ($pos = strpos($query_string, "?") !== false && $n < count($params)) {
// If it's numeric, don't use quotes around it.
$param = is_numeric($params[$n]) ? $params[$n] : "'" . $params[$n] . "'";
// Update the query string with the replaced question mark
$query_string = substr($query_string, 0, $pos) //or $pos-1? It's pseudocode...
. $param
. substr($query_string, $pos + 1);
$n++;
If the latter is the case, I'm not going to switch my sites to parameterization just yet. It has no advantage that I can see, it's just another strong vs weak variable typing discussion. Strong typing may catch more errors in compiletime, but it doesn't really make anything possible that would be hard to do otherwise - same with this parameterization. (Please correct me if I'm wrong!)
Update:
I knew this would depend on the SQL server (and also on the client, but I assume the client uses the best possible techniques), but mostly I had MySQL in mind. Answers concerning other databases are (and were) also welcome though.
As far as I understand the answers, parameterization does indeed do more than simply escaping the data. It is really sent to the server in a parameterized way, so with variables separated and not as a single query string.
This also enables the server to store and reuse the query with different parameters, which provides better performance.
Did I get everything? One thing I'm still curious about is whether MySQL has these features, and if query reusage is automatically done (or if not, how this can be done).
Also, please comment when anyone reads this update. I'm not sure if it bumps the question or something...
Thanks!
I'm sure that the way that your command and parameters are handled will vary depending on the particular database engine and client library.
However, speaking from experience with SQL Server, I can tell you that parameters are preserved when sending commands using ADO.NET. They are not folded into the statement. For example, if you use SQL Profiler, you'll see a remote procedure call like:
exec sp_executesql N'INSERT INTO Test (Col1) VALUES (#p0)',N'#p0 nvarchar(4000)',#p0=N'p1'
Keep in mind that there are other benefits to parameterization besides preventing SQL injection. For example, the query engine has a better chance of reusing query plans for parameterized queries because the statement is always the same (just the parameter values change).
In response to update:
Query parameterization is so common I would expect MySQL (and really any database engine) to handle it similarly.
Based on the MySQL protocol documentation, it looks like prepared statements are handled using COM_PREPARE and COM_EXECUTE packets, which do support separate parameters in binary format. It's not clear if all parameterized statements will be prepared, but it does look like unprepared statements are handled by COM_QUERY which has no mention of parameter support.
When in doubt: test. If you really want to know what's sent over the wire, use a network protocol analyzer like Wireshark and look at the packets.
Regardless of how it's handled internally and any optimizations it may or may not currently provide for a given engine, there's very little (nothing?) to gain from not using parameters.
Parameterized query are passed to SQL implementation as parameterized query, the parameters are never concatenated to the query itself unless an implementation decided to fallback to concatenating themselves. Parameterized query avoids the need for escaping, and improves performance since the query is generic and it is more likely that a compiled form of the query is already cached by the database server.
The straight answer is "it's implemented whatever way it's implemented in the particular implementation in question". There's dozens of databases, dozens of access layers and in some cases more than one way for the same access layer to deal with the same code.
So, there isn't a single correct answer here.
One example would be that if you use Npgsql with a query that isn't a prepared statement, then it pretty much just escapes things correctly (though escaping in Postgresql has some edge cases that people who know about escaping miss, and Npgsql catches them all, so still a gain). With a prepared statement, it sends parameters as prepared-statment parameters. So one case allows for greater query-plan reuse than another.
The SQLServer driver for the same framework (ADO.NET) passes queries through as calls to sp_executesql, which allows for query-plan re-use.
As well as that, the matter of escaping is still worth considering for a few reasons:
It's the same code each time. If you're escaping yourself, then either you're doing so through the same piece of code each time (so it's not like there's any downside to using someone else's same piece of code), or you're risking a slip-up each time.
They're also better at not escaping. There's no point going through every character in the string representation of a number looking for ' characters, for example. But does not escaping count as a needless risk, or a reasonable micro-optimisation.
Well, "reasonable micro-optimisation" in itself means one of two things. Either it requires no mental effort to write or to read for correctness afterwards (in which case you might as well), or it's hit frequently enough that tiny savings will add up, and it's easily done.
(Relatedly, it also makes more sense to write a highly optimised escaper - the sort of string replacement involved is the sort of case where the most common approach of replacing isn't as fast as some other approaches in some languages at least, but the optimisation only makes sense if the method will be called a very large number of times).
If you've a library that includes type checking the parameter (either in basing the format used on the type, or by validation, both of which are common with such code), then it's easy to do and since these libraries aim at mass use, it's a reasonable micro-opt.
If you're thinking each time about whether parameter number 7 of an 8-parameter call could possibly contain a ' character, then it's not.
They're also easier to translate to other systems if you want. To again look at the two examples I gave above, apart from the classes created, you can use pretty much identical code with System.Data.SqlClient as with Npgsql, though SQL-Server and Postgresql have different escaping rules. They also have an entirely different format for binary strings, date-times and a few other datatypes they have in common.
Also, I can't really agree with calling this a "hot topic". It's had a well-established consensus for well over a decade at the very least.
Let say if I used the Linq to Sql concept to interact with database from C# language , then what challenges I may be face? means in terms of architecture, performance , type safety, objects orientation etc ..!
Basically Linq to SQL generates a class for each table in your database, complete with relation properties and all, so you will have no problems with type safety. The use of C# partials allows you to add functionality to these objects without messing around with Linq to SQLs autogenerated code. It works pretty well.
As tables map directly to classes and objects, you will either have to accept that your domain layer mirrors the database design directly, or you will have to build some form of abstraction layer above Linq to SQL. The direct mirroring of tables can be especially troublesome with many-to-many relations, which is not directly supported - instead of Orders.Products you get Order.OrderDetails.SelectMany(od => od.Product).
Unlike most other ORMs Linq to SQL does not just dispense objects from the database and allow you to store or update objects by passing them back into the ORM. Instead Linq to SQL tracks the state of objects loaded from the database, and allows you to change the saved state. It is difficult to explain and strange to understand - I recommend you read some of Rick Strahls blogposts on the subject.
Performance wise Linq-to-SQL does pretty good. In benchmarking tests it shows speeds of about 90-95% of what a native SQL reader would provide, and in my experience real world usage is also pretty fast. Like all ORMs Linq to SQL is affected by the N+1 selects problem, but it provides good ways to specify lazy/eager loading depending on context.
Also, by choosing Linq to SQL you choose MSSQL - there do exist third party solutions that allow you to connect to other databases, but last time I checked, none of them appeared very complete.
All in all, Linq to SQL is a good and somewhat easy to learn ORM, which performs okay. If you need features beyond what Linq to SQL is offering, take a look at the new entity framework - it has more features, but is also more complex.
We've had a few challenges, mainly from opening the query construction capability to programmers that don't understand how databases work. Here are a few smells:
//bad scaling
//Query in a loop - causes n roundtrips
// when c roundtrips could have been performed.
List<OrderDetail> od = new List<OrderDetail>();
foreach(Customer cust in customers)
{
foreach(Order o in cust.Orders)
{
od.AddRange(dc.OrderDetails.Where(x => x.OrderId = o.OrderId));
}
}
//no seperation of
// operations intended for execution in the database
// from operations intended to be executed locally
var query =
from c in dc.Customers
where c.City.StartsWith(textBox1.Text)
where DateTime.Parse(textBox2.Text) <= c.SignUpDate
from o in c.Orders
where o.OrderCode == Enum.Parse(OrderCodes.Complete)
select o;
//not understanding when results are pulled into memory
// causing a full table load
List<Item> result = dc.Items.ToList().Skip(100).Take(20).ToList();
Another problem is that one more level of separation from the table structures means indexes are even easier to ignore (that's a problem with any ORM though).
In a controversial blog post today, Hackification pontificates on what appears to be a bug in the new LINQ To Entities framework:
Suppose I search for a customer:
var alice = data.Customers.First( c => c.Name == "Alice" );
Fine, that works nicely. Now let’s see
if I can find one of her orders:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
LINQ-to-SQL will find the child row.
LINQ-to-Entities will silently return
nothing.
Now let’s suppose I iterate through
all orders in the database:
foreach( var order in data.Orders ) {
Console.WriteLine( "Order: " + order.Item ); }
And now repeat my search:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
Wow! LINQ-to-Entities is suddenly
telling me the child object exists,
despite telling me earlier that it
didn’t!
My initial reaction was that this had to be a bug, but after further consideration (and backed up by the ADO.NET Team), I realized that this behavior was caused by the Entity Framework not lazy loading the Orders subquery when Alice is pulled from the datacontext.
This is because order is a LINQ-To-Object query:
var order = ( from o in alice.Orders
where o.Item == "Item_Name"
select o ).FirstOrDefault();
And is not accessing the datacontext in any way, while his foreach loop:
foreach( var order in data.Orders )
Is accessing the datacontext.
LINQ-To-SQL actually created lazy loaded properties for Orders, so that when accessed, would perform another query, LINQ to Entities leaves it up to you to manually retrieve related data.
Now, I'm not a big fan of ORM's, and this is precisly the reason. I've found that in order to have all the data you want ready at your fingertips, they repeatedly execute queries behind your back, for example, that linq-to-sql query above might run an additional query per row of Customers to get Orders.
However, the EF not doing this seems to majorly violate the principle of least surprise. While it is a technically correct way to do things (You should run a second query to retrieve orders, or retrieve everything from a view), it does not behave like you would expect from an ORM.
So, is this good framework design? Or is Microsoft over thinking this for us?
Jon,
I've been playing with linq to entities also. It's got a long way to go before it catches up with linq to SQL. I've had to use linq to entities for the Table per Type Inheritance stuff. I found a good article recently which explains the whole 1 company 2 different ORM technologies thing here.
However you can do lazy loading, in a way, by doing this:
// Lazy Load Orders
var alice2 = data.Customers.First(c => c.Name == "Alice");
// Should Load the Orders
if (!alice2.Orders.IsLoaded)
alice2.Orders.Load();
or you could just include the Orders in the original query:
// Include Orders in original query
var alice = data.Customers.Include("Orders").First(c => c.Name == "Alice");
// Should already be loaded
if (!alice.Orders.IsLoaded)
alice.Orders.Load();
Hope it helps.
Dave
So, is this good framework design? Or is Microsoft over thinking this for us?
Well lets analyse that - all the thinking that Microsoft does so we don't have to really makes us lazier programmers. But in general, it does make us more productive (for the most part). So are they overthinking or are they just thinking for us?
If LINQ-to-Sql and LINQ-to-Entities came from two different companies, it would be an acceptable difference - there's no law stating that all LINQ-To-Whatevers have to be implemented the same way.
However, they both come from Microsoft - and we shouldn't need intimate knowledge of their internal development teams and processes to know how to use two different things that, on their face, look exactly the same.
ORMs have their place, and do indeed fill a gap for people trying to get things done, but the ORM uses must know exactly how their ORM gets things done - treating it like an impenetrable black box will only lead you to trouble.
Having lost a few days to this very problem, I sympathize.
The "fault," if there is one, is that there's a reasonable tendency to expect that a layer of abstraction is going to insulate from these kinds of problems. Going from LINQ, to Entities, to the database layer, doubly so.
Having to switch from MS-SQL (using LingToSQL) to MySQL (using LinqToEntities), for instance, one would figure that the LINQ, at least, would be the same if not just to save from the cost of having to re-write program logic.
Having to litter code with .Load() and/or LINQ with .Include() simply because the persistence mechanism under the hood changed seems slightly disturbing, especially with a silent failure. The LINQ layer ought to at least behave consistently.
A number of ORM frameworks use a proxy object to dynamically load the lazy object transparently, rather than just return null, though I would have been happy with a collection-not-loaded exception.
I tend not to buy into the they-did-it-deliberately-for-your-benefit excuse; other ORM frameworks let you annotate whether you want eager or lazy-loading as needed. The same could be done here.
I don't know much about ORMs, but as a user of LinqToSql and LinqToEntities I would hope that when you try to query Orders for Alice it does the extra query for you when you make the linq query (as opposed to not querying anything or querying everything for every row).
It seems natural to expect
from o in alice.Orders where o.Item == "Item_Name" select o
to work given that's one of the reasons people use ORM's in the first place (to simplify data access).
The more I read about LinqToEntities the more I think LinqToSql fulfills most developers needs adequately. I usually just need a one-to-one mappingn of tables.
Even though you shouldn't have to know about Microsoft's internal development teams and processes, fact of the matter is that these two technologies are two completely different beasts.
The design decision for LINQ to SQL was, for simplicity's sake, to implicitly lazy-load collections. The ADO.NET Entity Framework team didn't want to execute queries without the user knowing so they designed the API to be explicitly-loaded for the first release.
LINQ to SQL has been handed over to ADO.NET team and so you may see a consolidation of APIs in the future, or LINQ to SQL get folded into the Entity Framework, or you may see LINQ to SQL atrophy from neglect and eventually become deprecated.