I have a Grails application that does a rather huge createCriteria query pulling from many tables. I noticed that the performance is pretty terrible and have pinpointed it to the Object manipulation I do afterwards, rather than the createCriteria itself. My query successfully gets all of the original objects I wanted, but it is performing a new query for each element when I am manipulating the objects. Here is a simplified version of my controller code:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
// Lots of if statements for filters, etc.
}
def results = hosts?.collect{ [ cell: [
it.hostname,
it.type,
it.status.toString(),
it.env.toString(),
it.supporter.person.toString()
...
]]}
I have many more fields, including calls to methods that perform their own queries to find related objects. So my question is: How can I incorporate joins into the original query so that I am not performing tons of extra queries for each individual row? Currently querying for ~700 rows takes 2 minutes, which is way too long. Any advice would be great! Thanks!
One benefit you get using criteria is you can easily fetch associations eagerly. As a result of which you would not face the well known N+1 problem while referring associations.
You have not mentioned the logic in criteria but I would suggest for ~700 rows I would definitely go for something like this:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
...
//associations are eagerly fetched if a DSL like below
//is used in Criteria query
supporter{
person{
}
}
someOtherAssoc{
//Involve logic if required
//eq('someOtherProperty', someOtherValue)
}
}
If you feel that tailoring a Criteria is cumbersome, then you can very well fallback to HQL and use join fetch for eager indexing for associations.
I hope this would definitely reduce the turnaround time to less than 5 sec for ~700 records.
Related
I have built an application with Laravel where I end up having rather deep-nested relationships, that I sometimes need to query. Database is MySQL.
For instance, I want to retrieve all Users who are allowed to read a Book. My data is structured as follows:
A User belongs to 0-n UserGroups through a UserMembership
A UserGroup has 0-n Rights
A Right relates to 1 Book and describes what action can be performed
After looking and browsing, I found that some people were recommending the following way to address nested relationships:
// class Book extends Model
public function readers() {
$bookId= $this->id;
return User::whereHas('memberships', function($m) use($bookId) {
$m->whereHas('group', function($g) use($bookId) {
$g->whereHas('rights', function($r) use($bookId) {
$r->where('resource_id', $bookId)->where('action', 'read');
});
});
});
}
I like that the code makes a lot of sense, but the performance is terrible.. Execution time is 430ms on average for Book::find(967)->readers()->get()
I re-wrote the function as follows:
public function readersNew() {
$bookId= $this->id;
$g = Right::where('resource_id', $bookId)->where('action', 'read')->pluck('group_id');
$uIds = UserMembership::whereIn('group_id', $g)->pluck('user_id');
return User::whereIn('id', $uIds);
}
With this code I achieve an average exec time of 4ms which is obviously much better. But this also looks much less "methodical" in terms of writing nested requests.
I would really like to understand :
why readers()->get() is so much slower than readersNew()->get()
what the best way is to write such requests
First of all, good job for improving the performance without even knowing why is it happening :)
Q1: why readers()->get() is so much slower than readersNew()->get()
Your readers()->get() function traverse in the hierarchy up to down which is why it makes more sense but its slower. It is same as running 3 foreach nested loops, it first returns all users which has membership and then iterate for each user and finds all the groups its belongs and then iterates each group and find the rights for each and then iterates each rights and gets the desired entry by matching resource_id and action.
whereas your readersNew()->get() traverse in the hierarchy down to up, that is why its faster. It first extracts the target group based upon the matched right and then extracts the membership, user associated with that group, hence faster.
Q2 what the best way is to write such requests
The approach readersNew()->get() is the best, you could just change your writing conventions to make more sense if you like:
public function readersNew() {
$bookId= $this->id;
$targetGroup = Right::where(['resource_id' => $bookId, 'action' => 'read'])->pluck('group_id');
$associatedUserIds = UserMembership::whereIn('group_id', $g)->pluck('user_id');
return User::whereIn('id', $associatedUserIds);
}
I hope it helps
With regards to Laravel Eloquent queries and eager loading, which of these queries is most efficient? Does it make a diference?
$data = Model::with('relationship')
->with('relationship.content')
->with('meta')
->with('meta.meta_type')
->first
as opposed to :
$data = Model::with('relationship', 'relationship.content', 'meta', 'meta.meta_type')
->first;
There is a difference. Is it the difference you're thinking of?
Is there a difference in the query/queries performed, or the data returned?
Answer: No.
All of the calls to with() will be combined into a single set of eager loads, which will be parsed and queried when the query is completed upon calling first(). Both code examples will be turned into the same set of eager loads, and the resulting models should be identical.
Is there a performance difference?
Answer: Yes, a small one.
The performance difference is very small, and would be considered a micro-optimization by many (meaning that it is only worth optimizing if you are working at a significant scale).
Each call to with() will determine the type of value you've passed (one or more strings or an array), validate and parse the relationships including finding any nested, and then merge the results with any existing relationships from previous with() calls.
If you're interested in writing the most optimal code, the first and largest step you can take is to only call with() once:
$data = Model::with('relationship', 'relationship.content', 'meta', 'meta.meta_type')
->first();
If the value supplied to the call contains one or more strings, PHP's func_get_args() is called. If you pass an array, the array is used directly. That's the next optimization we can make: use an array.
$data = Model::with(['relationship', 'relationship.content', 'meta', 'meta.meta_type'])
->first();
Finally, when you pass a nested relationship to Laravel, both relationships will be included in the eager load. Including both parent and parent.child relationships is redundant.
Your call - still functionally identical - can be reduced to:
$data = Model::with(['relationship.content', 'meta.meta_type'])
->first();
No, it does not make a difference.
The with() method accepts either a single relation or relations.
Both of the things you have done is the same thing.
Here's how it works behind the scenes:
/**
* Begin querying a model with eager loading.
*
* #param array|string $relations
* #return \Illuminate\Database\Eloquent\Builder|static
*/
public static function with($relations)
{
return (new static)->newQuery()->with(
is_string($relations) ? func_get_args() : $relations
);
}
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
I have been playing about with LINQ-SQL, trying to get re-usable chunks of expressions that I can hot plug into other queries. So, I started with something like this:
Func<TaskFile, double> TimeSpent = (t =>
t.TimeEntries.Sum(te => (te.DateEnded - te.DateStarted).TotalHours));
Then, we can use the above in a LINQ query like the below (LINQPad example):
TaskFiles.Select(t => new {
t.TaskId,
TimeSpent = TimeSpent(t),
})
This produces the expected output, except, a query per row is generated for the plugged expression. This is visible within LINQPad. Not good.
Anyway, I noticed the CompiledQuery.Compile method. Although this takes a DataContext as a parameter, I thought I would include ignore it, and try the same Func. So I ended up with the following:
static Func<UserQuery, TaskFile, double> TimeSpent =
CompiledQuery.Compile<UserQuery, TaskFile, double>(
(UserQuery db, TaskFile t) =>
t.TimeEntries.Sum(te => (te.DateEnded - te.DateStarted).TotalHours));
Notice here, that I am not using the db parameter. However, now when we use this updated parameter, only 1 SQL query is generated. The Expression is successfully translated to SQL and included within the original query.
So my ultimate question is, what makes CompiledQuery.Compile so special? It seems that the DataContext parameter isn't needed at all, and at this point i am thinking it is more a convenience parameter to generate full queries.
Would it be considered a good idea to use the CompiledQuery.Compile method like this? It seems like a big hack, but it seems like the only viable route for LINQ re-use.
UPDATE
Using the first Func within a Where statment, we see the following exception as below:
NotSupportedException: Method 'System.Object DynamicInvoke(System.Object[])' has no supported translation to SQL.
Like the following:
.Where(t => TimeSpent(t) > 2)
However, when we use the Func generated by CompiledQuery.Compile, the query is successfully executed and the correct SQL is generated.
I know this is not the ideal way to re-use Where statements, but it shows a little how the Expression Tree is generated.
Exec Summary:
Expression.Compile generates a CLR method, wheras CompiledQuery.Compile generates a delegate that is a placeholder for SQL.
One of the reasons you did not get a correct answer until now is that some things in your sample code are incorrect. And without the database or a generic sample someone else can play with chances are further reduced (I know it's difficult to provide that, but it's usually worth it).
On to the facts:
Expression<Func<TaskFile, double>> TimeSpent = (t =>
t.TimeEntries.Sum(te => (te.DateEnded - te.DateStarted).TotalHours));
Then, we can use the above in a LINQ query like the below:
TaskFiles.Select(t => new {
t.TaskId,
TimeSpent = TimeSpent(t),
})
(Note: Maybe you used a Func<> type for TimeSpent. This yields the same situation as of you're scenario was as outlined in the paragraph below. Make sure to read and understand it though).
No, this won't compile. Expressions can't be invoked (TimeSpent is an expression). They need to be compiled into a delegate first. What happens under the hood when you invoke Expression.Compile() is that the Expression Tree is compiled down to IL which is injected into a DynamicMethod, for which you get a delegate then.
The following would work:
var q = TaskFiles.Select(t => new {
t.TaskId,
TimeSpent = TimeSpent.Compile().DynamicInvoke()
});
This produces the expected output, except, a query per row is
generated for the plugged expression. This is visible within LINQPad.
Not good.
Why does that happen? Well, Linq To Sql will need to fetch all TaskFiles, dehydrate TaskFile instances and then run your selector against it in memory. You get a query per TaskFile likely because they contains one or multiple 1:m mappings.
While LTS allows projecting in memory for selects, it does not do so for Wheres (citation needed, this is to the best of my knowledge). When you think about it, this makes perfect sense: It is likely you will transfer a lot more data by filtering the whole database in memory, then by transforming a subset of it in memory. (Though it creates query performance issues as you see, something to be aware of when using an ORM).
CompiledQuery.Compile() does something different. It compiles the query to SQL and the delegate it returns is only a placeholder Linq to SQL will use internally. You can't "invoke" this method in the CLR, it can only be used as a node in another expression tree.
So why does LTS generate an efficient query with the CompiledQuery.Compile'd expression then? Because it knows what this expression node does, because it knows the SQL behind it. In the Expression.Compile case, it's just a InvokeExpression that invokes the DynamicMethod as I explained previously.
Why does it require a DataContext Parameter? Yes, it's more convenient for creating full queries, but it's also because the Expression Tree compiler needs to know the Mapping to use for generating the SQL. Without this parameter, it would be a pain to find this mapping, so it's a very sensible requirement.
I'm surprised why you've got no answers on this so far. CompiledQuery.Compile compiles and caches the query. That is why you see only one query being generated.
Not only this is NOT a hack, this is the recommended way!
Check out these MSDN articles for detailed info and example:
Compiled Queries (LINQ to Entities)
How to: Store and Reuse Queries (LINQ to SQL)
Update: (exceeded the limit for comments)
I did some digging in reflector & I do see DataContext being used. In your example, you're simply not using it.
Having said that, the main difference between the two is that the former creates a delegate (for the expression tree) and the latter creates the SQL that gets cached and actually returns a function (sort of). The first two expressions produce the query when you call Invoke on them, this is why you see multiple of them.
If your query doesn't change, but only the DataContext and Parameters, and if you plan to use it repeatedly, CompiledQuery.Compile will help. It is expensive to Compile, so for one off queries, there is no benefit.
TaskFiles.Select(t => new {
t.TaskId,
TimeSpent = TimeSpent(t),
})
This isn't a LinqToSql query, as there is no DataContext instance. Most likely you are querying some EntitySet, which does not implement IQueryable.
Please post complete statements, not statement fragments. (I see invalid comma, no semicolon, no assignment).
Also, Try this:
var query = myDataContext.TaskFiles
.Where(tf => tf.Parent.Key == myParent.Key)
.Select(t => new {
t.TaskId,
TimeSpent = TimeSpent(t)
});
// where myParent is the source of the EntitySet and Parent is a relational property.
// and Key is the primary key property of Parent.
I'm used to EF because it usually works just fine as long as you get to know it better, so you know how to optimize your queries. But.
What would you choose when you know you'll be working with large quantities of data? I know I wouldn't want to use EF in the first place and cripple my application. I would write highly optimised stored procedures and call those to get certain very narrow results (with many joins so they probably won't just return certain entities anyway).
So I'm a bit confused which DAL technology/library I should use? I don't want to use SqlConnection/SqlCommand way of doing it, since I would have to write much more code that's likely to hide some obscure bugs.
I would like to make bug surface as small as possible and use a technology that will accommodate my process not vice-a-versa...
Is there any library that gives me the possibility to:
provide the means of simple SP execution by name
provide automatic materialisation of returned data so I could just provide certain materialisers by means of lambda functions?
like:
List<Person> result = Context.Execute("StoredProcName", record => new Person{
Name = record.GetData<string>("PersonName"),
UserName = record.GetData<string>("UserName"),
Age = record.GetData<int>("Age"),
Gender = record.GetEnum<PersonGender>("Gender")
...
});
or even calling stored procedure that returns multiple result sets etc.
List<Question> result = Context.ExecuteMulti("SPMultipleResults", q => new Question {
Id = q.GetData<int>("QuestionID"),
Title = q.GetData<string>("Title"),
Content = q.GetData<string>("Content"),
Comments = new List<Comment>()
}, c => new Comment {
Id = c.GetData<int>("CommentID"),
Content = c.GetData<string>("Content")
});
Basically this last one wouldn't work, since this one doesn't have any knowledge how to bind both together... but you get the point.
So to put it all down to a single question: Is there a DAL library that's optimised for stored procedure execution and data materialisation?
Business Layer Toolkit might be exactly what's needed here. It's a lightweight ORM tool that supports lots of scenarios including multiple result sets although they seem very complicated to do.