LINQtoSQL caching question - linq-to-sql

I have been doing a lot of reading but not coming up with any good answers on LinqToSql caching...I guess the best way to ask my question is to just ask it.
I have a jQuery script calling a WCF service based on info that the script is getting from the 1st two cells of each row of a table. Basically its looping through the table, calling the service with info from the table cells, and updating the row based on info returned from the service.
The service itself is running a query based on the info from the client basically in the form of:
Dim b = From r In db.batches _
Where r.TotalDeposit = amount _
And r.bDate > startDate AndAlso r.bDate < endDate _
Select r
Using firebug I noticed that each response was taking anywhere between 125ms-3secs per. I did some research and came across a article about caching LINQ objects and applied it to my project. I was able to return stuff like the count of the object (b.Count) as a Response in a page and noticed that it was caching, so I thought I was cooking with grease...however when I tried running the above query against the cached object the times became a consistent 700ms, too long.
I read somewhere that LINQ caches automatically so I did the following:
Dim t As New List(Of batch)
Dim cachedBatch = From d In db.batches _
Select d
t = From r In cachedBatch _
Where r.TotalDeposit = amount _
And r.bDate > startDate AndAlso r.bDate < endDate _
Select r
Return t
Now the query runs at a consistent 120-140ms response time...what gives??? I'm assuming its caching since running the query against the db takes a little while (< 35,000 records).
My question I guess then is, should I be trying to cache LINQ objects? Is there a good way to do so if I'm missing the mark?
As usual, thanks!!!

DO NOT USE the code in that linked article. I don't know what that person was smoking, but the code basically reads the entire contents of a table and chucks it in a memory cache. I can't think of a much worse option for a non-trivial table (and 35,000 records is definitely non-trivial).
Linq to SQL does not cache queries. Linq to SQL tracks specific entities retrieved by queries, using their primary keys. What this means is that if you:
Query the DataContext for some entities;
Change those entities (but don't call SubmitChanges yet);
Run another query that retrieves the same entities.
Then the results of #3 above will be the same entities you retrieved in (1) with the changes you made in (2) - in other words, you get back the existing entities that Linq is already tracking, not the old entities from the database. But it still has to actually execute the query in order to know which entities to load; change tracking is not a performance optimization.
If your database query is taking more than about 100 ms then the problem is almost certainly on the database side. You probably don't have the appropriate indexes on the columns that you are querying on. If you want to cache instead of dealing with the DB perf issue then you need to cache the results of specific queries, which you would do by keying them to the parameters used to create the query. For example (C#):
IEnumerable<Batch> GetBatches(DateTime startDate, DateTime endDate,
Decimal amount)
{
string cacheKey = string.Format("GetBatches-{0}-{1}-{2}",
startDate, endDate, amount);
IEnumerable<Batch> results = Cache[cacheKey];
if (results != null)
{
return results;
}
results = <LINQ QUERY HERE>.ToList();
Cache.Add(cacheKey, results, ...);
return results;
}
This is fine as long as the results can't be changed while the item is in the cache, or if you don't care about getting stale results. If this is an issue, then it starts to become a lot more complicated, and I won't get into all of the subtleties here.
The bottom line is, "caching" every single record in a table is not caching at all, it's turning an efficient relational database (SQL Server) into a sloppy, inefficient in-memory database (a generic list in a cache). Don't cache tables, cache queries if you need to, and before you even decide to do that, try to solve the performance issue in the database itself.
For the record I should also note that someone seems to have implemented a form of caching based on the IQueryable<T> itself. I haven't tested this method, and I'm not sure how much easier it would be than the above to use in practice (you still have to specifically choose to use it, it's not automatic), but I'm listing it as a possible alternative.

Related

Is getting the table size in JPA an expensive operation?

I implement server-side pagination for table viewing in my web application. This means the user has buttons to activate first-page, last-page, next-page, and prior-page. Each click results in a server request where only the records to be shown are returned.
To implement that "last page" function and a scroll bar I need the client to have the size of the table. I can get this on the server-side with the following method:
public long getCount(Class entityClass) {
CriteriaBuilder builder = em.getCriteriaBuilder();
CriteriaQuery<Long> query = builder.createQuery(Long.class);
Root root = query.from(entityClass);
Expression<Long> count = builder.count(root);
query.select(count);
TypedQuery<Long> typedQuery = em.createQuery(query);
return typedQuery.getSingleResult();
}
This table could be very active with millions of records. Does running this function cause a lot of CPU cycles in the SQL server to be utilized?
The concern is how well this application will scale.
That depends entirely on the database, all JPA implementations I know will translate count to select count(*) from Table. We have a Postgresql with single table with 130GB data, and most rows are only a few kilobytes. Doing select (*) from table takes minutes; a developer once did a simple undeindexed select query, and a full table scan takes about 45 minutes.
When doing pagination, you often have a filter, and it is important to apply the same fileter to both the data-query and the count-query (one of the main reasons for using CriteriaBuilder is to share the filtering part of the query). Today I would recommend using Spring-data, since it makes pagination almost effortlessly.
If you have a lots of data, you can do like google, which say there are 1.340.000.000 results for 'zip', but only allows you to jump 10 pages ahead, and if you run it to the end you will see that they only actually load 1000 pages. In other words they cache an estimate size, but require you to narrow the search to give you more precise results.

How to retrieve large sets of data accross multiple tables and avoid a looping queries

First sorry if the question was already answered, I searched both here and Google and couldn't find my answer. This question can't possibly haven't been asked, but it is hidden pretty deep under all the "Just use LEFT JOIN" and "store it in an array" answers.
I need to load a lot of data spread across multiple tables (then insert it into another database engine, but that's not important, I need to optimize my SELECTs).
The table layout looks like this:
Table A with a_id field
Table B with a_id and b_id field
Table C with b_id and c_id field
... (goes another 3-4 levels like this).
I currently access the data this way (pseudo code):
query1 = SELECT ... FROM TableA WHERE something=$something
foreach query1 as result1:
query2 = SELECT ... FROM TableB WHERE b_id=result1.a_id
foreach query2 as result2:
query3 = SELECT ... FROM TableC WHERE bc_id=result2.b_id
foreach query3 as result3:
// Another few levels of this, see the millions of SELECTs coming?
The only solutions I have found so far are:
Use the slow way and send multiple queries (current solution, and it takes ages to complete my small test set)
Use a ton of LEFT JOIN to have all the data in one query. Involves transmitting a ton of data thousands of times and so some fancy logic on client side to split this into their appropriate tables again since each row will contain the content of its parent tables. (I use OOP and each table maps to an object, and each object contains each-other).
Store each object from table A in an array, then load all Table B, store into an array, continue on Table C. Works for small sets, but mine is a few GB large, won't fit into ram at all.
Is there a way to avoid doing 10k queries per second in such a loop?
(I'm using PHP, converting from MySQL to MongoDB which handles nested objects like this way better, if this helps)
EDIT: There seem to have some confusions about what I'm trying to do and why. I will try to explain better: I need to do a batch conversion to a new structure. The new structure works very well, don't even bother looking on that. I'm remaking a very old website from scratch, and chose MongoDB as my storage engine because we have loads of nested data like this, and it works very well for me. Switching back to MySQL is not even an option for me, the new structure and code is alreay well established and I've been working on this for about a year now. I am not looking in a way to optimize the current schema, I can't. The data is that way, and I need to read the whole database. Once. Then I'm done with it.
All I need to do, is to import the data from the old website, process this and convert it so I can insert it into our new website. Here comes MySQL: The older site was a very normal PHP/MySQL site. We have a lot of tables (about 70 actually or something). We don't have many users, but each users have a ton of data spanned on 7 tables.
What I currently do, is that I loop on each user (1 query). For each of these users (70k), I load Table A which contains 10-80 rows for each user. I then query Table B on every loop of A (so, 10-80 times 70k), which contains 1-16 rows for each A. There comes Table C, which holds 1-4 rows for each B. We are now at 4*80*70k queries to do. Then I have D, 1-32 rows for each C. E with 1-16 rows for each D. F with 1-16 rows for each E. Table F has a couple of millions rows.
Problem is
I end up doing thousands if not millions of queries to the MySQL server, where the production database is not even on my local machine, but 5-10ms away. Even at 0.01ms, I have hours just in network latency. I created a local replica so my restricted test set runs quite faster, but it's still going to take a long while to download a few GB of data like this.
I could keep the members table in RAM and maybe Table A so I can download each database in one shot instead of doing thousands of queries, but once at Table B and further it would be a real mess to track this in memory, especially since I use PHP (command line, at least), which uses a bit more memory than if it was a C++ program where I could have tight RAM control. So this solution doesn't work either.
I could JOIN all the tables together, but if it works for 2-3 tables, doing this for 7 tables would result in an extra huge bandwidth loss transferring the same data from the server millions of times without a use (while also making the code really complicated to split them back in the appropriate order).
Question is: Is there a way to not query the database so often? Like, telling the MySQL server with a procedure or something that I will need all these datasets in this order so I don't have to re-do a query each row and so the database just continually spits out data for me? The current problem is just that I do so much queries that both the script AND the database are almost idle because one is always waiting for another one. The queries themselves are actually very fast, basic prepared SELECT queries on indexed int fields.
This is a problem I always got myself into with MySQL in the past, which never really caused me trouble until now. In its current state, the script takes several hours if not days to complete. It's not THAT bad, but if there's a way I can do better I'd appreciate to know. If not, then okay, I'll just wait for it to finish, at least it will run max 3-4 times (2-3 test runs, have users check their data is converted correctly, fix bugs, try again, and the final run with the last bugfixes).
Thanks in advance!
Assuming your 7 tables are linked by ids, do something like this
First query
'SELECT * FROM table_a WHERE a_id IN (12,233,4545,67676,898999)'
// store the result in $result_of_first_query
Then do a foreach and pick the ids you want to use in the next query in a comma separated variable (csv)
foreach($result_of_first_query as $a_row_from_first_table)
{
$csv_for_second_query = $csv_for_second_query.$a_row_from_first_table['b_id'].",";
}
$csv_for_second_query = trim($csv_for_second_query,", "); // problem is we will have a lot of duplicate entries
$temp_arr = array(); // so lets remove the duplicates
$temp_arr = explode(",",$csv_for_second_query); // explode values in array
$temp_arr = array_unique($temp_arr); // remove duplicates
$csv_for_second_query = implode(",",$temp_arr); // create csv string again. ready!
now for your second table, you will get, with only 1 query all the values you need to JOIN (not by mysql, we will do this with php)
Second query
'SELECT * FROM table_b where a_id IN ('.$csv_for_second_query.')'
// store the result in $result_of_second_query;
Then we just need to programmatically join the two arrays.
$result_a_and_b = array(); // we will store the joined result of every row here
// lets scan every row from first table
foreach($result_of_first_query as $inc=> $a_row_from_first_table)
{
// assign every row from frist table to result_a_and_b
$result_a_and_b[$inc]['a']=$a_row_from_first_table;
$inc_b=0; // counter for the joins that will happen by data from second table
// for every row from first table we will scan every row from second table
// so we need this nested foreach
foreach($result_of_second_query as $a_row_from_second_table)
{
// are data need to join? if yes then do so! :)
if($a_row_from_first_table['a_id']==$a_row_from_second_table['a_id'])
{
$result_a_and_b[$inc]['b'][$inc_b]=$a_row_from_second_table; // "join" in our "own" way :)
++$inc_b; // needed for the next join
}
}
}
now we have the array $result_a_and_b with this format:
$result_a_and_b[INDEX]['a']
$result_a_and_b[INDEX]['b'][INDEX]
so with 2 queries, we have a result similar to TABLE_A_ROWS_NUMBER + 1 (one is the initial query of first table)
Like this keep doing as many levels you want.
Query database with the id that links the table
get the id's in CSV string
do query in next able using WHERE id IN(11,22,33,44,55,.....)
join programmatically
Tip: You can use unset() to free up memory on temp variables.
I believe i answered in your question "Is there a way to not query the database so often?"
note: code not tested for typos, maybe i missed a comma or two -or maybe not
i believe you can get the point :) hope it helps!
Thanks everyone for the anwers. I came to the conclusion that I can't actually do it any other way.
My own solution is to set up a replica database (or just a copy if a snapshot is enough) on localhost. That way, it cuts down the network latency and allows both the script and the database to reach 100% CPU usage, and it seems to be the fastest I can get without reorganizing my script entirely.
Of course, this only works for one-time scripts. The correct way to handle this would be a mix of both answers I got as of now: use multiple unbuffered connections in threads, and process by batch (load 50 rows from Table A, store in ram, load all data related to Table A from Table B, store in RAM, then process all that and continue from Table A).
Thanks anyway for the answers all!

Using DataContext cache to preload child collection

In an attempt to reduce round-trips to the database, I was hoping to 'preload' child collections for a parent object. My hope was that if I loaded the objects that make up the child collection into the DataContext cache, Linq2SQL would use those objects instead of going to the database.
For example, assume I have a Person objects with two child collections: Children and Cars.
I thought this might work:
var children = from p in dbc.Person
select p.Children;
var cars = from p in dbc.Person
select p.Cars;
var people = from p in dbc.Person
select p;
var dummy1 = children.ToList();
var dummy2 = cars.ToList();
foreach(var person in people){
Debug.WriteLine(person.Children.Count);
}
But instead, I'm still getting one trip to the database for every call to person.Children.Count, even though all the children are already loaded in the DataContext.
Ultimately, what I'm looking for is a way to load a full object graph (with multiple child collections) in as few trips to the database as possible.
I'm aware of the DataLoadOptions class (and LoadWith), but it's limited to one child collection, and that's a problem for me.
I wouldn't mind using a stored procedure if I was still able to build up the full object graph (and still have a Linq2SQL object, of course) without a lot of extra manipulation of the objects.
I don't think what you require is directly possible with LINQ-SQL or even SQL in the way that your expecting.
When you consider how SQL works, a single one-many relationship can easily be flattened by an inner/left join. After this, if you want another set of objects including, you could theoretically write another left join which would bring back all the rows in the other table. But imagine the output SQL would provide, it's not easily workable back into an object graph. ORM's will typically start to query per row to provide this data. An example of this is to write multiple level DataLoadOptions using LINQ-SQL, you will start to see this.
Also, the impact on performance with many LEFT JOINs will easily outweigh the perceived benefits of single querying.
Consider your example. You are fetching ALL of the rows back from these two tables. This poses two problems further down the line:
The tables may get much more data in them than you expect. Pulling back 1000's of rows in SQL all at once may not be good for performance. You then expect LINQ-SQL to search the lists for matching objects. Dependent on use, I imagine SQL can provide this data faster.
The behavior you expect is hidden, and to me would be unusual that running a SELECT on a large table could potentially bloat the memory usage of the app by caching all of the rows. Maybe they could include it as an option or provide your own extension methods.
Solution
I would personally start to look at caching the data outside of LINQ-SQL, and then retrieve objects from the ASP.Net / Memory Cache / Your Cache provider, if it is deemed that the full object graph is expensive to create.
One level of DataLoadOptions plus relying on the built-in entity relationships + some manual caching will probably save many a headache.
I have come across this particular problem a few times and haven't thought of anything else yet.

"Min in Select statement" or DMin(). Which one is preferable?

I needed to find minimum revenue from a table tbl_Revenue. I found out two methods to do that:
Method 1
Dim MinRevenueSQL As String
Dim rsMinRev As DAO.Recordset
MinRevenueSQL = "SELECT Min(tbl_Revenue.Revenue_Value)As MinRevenue FROM tbl_Revenue WHERE (((tbl_Revenue.Division_ID)=20) AND ((tbl_Revenue.Period_Type)='Annual'));"
Set rsMinRev = CurrentDb.OpenRecordset(MinRevenueSQL)
MinRev = rsMinRev!MinRevenue
Method 2
MinRev2 = DMin("Revenue_Value", "tbl_Revenue", "(((tbl_Revenue.Division_ID)=20) AND ((tbl_Revenue.Period_Type)='Annual'))")
I have following questions:
which one of them is computationally more efficient? Is there a lot of difference in computational efficiency if instead of tbl_Revenue table there is a select statment using joins?
Is there a problem with accuracy of DMin fundtion? (By accuracy I mean are there any loopholes that I need to be aware of before using DMin.)
I suspect that the answer may vary depending on your situation.
In a single user situation, #transistor1 testing method will give you a good answer for an isolated lookup.
But on a db that's shared on a network, IF you already Set db = CurrentDb, then the SELECT method should be faster, since it does not require opening a second connection to the db, which is slow.
The same way, it is more efficient to Set db = CurrentDb and reuse that db everywhere.
In situations where I want to make sure I have the best speed, I use Public db as DAO.Database when opening the app. Then in every module where it is required, I use
If db is Nothing Then set db = CurrentDb.
In your specific code, you are running it once so it doesn't make much of a difference. If it's in a loop or a query and you are combining hundreds or thousands of iterations, then you will run into issues.
If performance over thousands of iterations is important to you, I would write something like the following:
Sub runDMin()
x = Timer
For i = 1 To 10000
MinRev2 = DMin("Revenue_Value", "tbl_Revenue", "(((tbl_Revenue.Division_ID)=20) AND ((tbl_Revenue.Period_Type)='Annual'))")
Next
Debug.Print "Total runtime seconds:" & Timer - x
End Sub
Then implement the same for the DAO query, replacing the MinRev2 part. Run them both several times and take an average. Try your best to simulate the conditions it will be run under; for example if you will be changing the parameters within each query, do the same, because that will most likely have an effect on the performance of both methods. I have done something similar with DAO and ADO in Access and was surprised to find out that under my conditions, DAO was running faster (this was a few years ago, so perhaps things have changed since then).
There is definitely a difference when it comes to using DMin in a query to get a minimum from a foreign table. From the Access docs:
Tip: Although you can use the DMin function to find the minimum value
from a field in a foreign table, it may be more efficient to create a
query that contains the fields that you need from both tables, and
base your form or report on that query.
However, this is slightly different than your situation, in which you are running both from a VBA method.
I have tended to believe (maybe erroneously because I don't have any evidence) that the domain functions (DMin, DMax, etc.) are slower than using SQL. Perhaps if you run the code above you could let us know how it turns out.
If you write the DMin call correctly, there are no accuracy issues that I am aware of. Have you heard that there were? Essentially, the call should be: DMin("<Field Name>", "<Table Name>", "<Where Clause>")
Good luck!

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.