Is getting the table size in JPA an expensive operation? - mysql

I implement server-side pagination for table viewing in my web application. This means the user has buttons to activate first-page, last-page, next-page, and prior-page. Each click results in a server request where only the records to be shown are returned.
To implement that "last page" function and a scroll bar I need the client to have the size of the table. I can get this on the server-side with the following method:
public long getCount(Class entityClass) {
CriteriaBuilder builder = em.getCriteriaBuilder();
CriteriaQuery<Long> query = builder.createQuery(Long.class);
Root root = query.from(entityClass);
Expression<Long> count = builder.count(root);
query.select(count);
TypedQuery<Long> typedQuery = em.createQuery(query);
return typedQuery.getSingleResult();
}
This table could be very active with millions of records. Does running this function cause a lot of CPU cycles in the SQL server to be utilized?
The concern is how well this application will scale.

That depends entirely on the database, all JPA implementations I know will translate count to select count(*) from Table. We have a Postgresql with single table with 130GB data, and most rows are only a few kilobytes. Doing select (*) from table takes minutes; a developer once did a simple undeindexed select query, and a full table scan takes about 45 minutes.
When doing pagination, you often have a filter, and it is important to apply the same fileter to both the data-query and the count-query (one of the main reasons for using CriteriaBuilder is to share the filtering part of the query). Today I would recommend using Spring-data, since it makes pagination almost effortlessly.
If you have a lots of data, you can do like google, which say there are 1.340.000.000 results for 'zip', but only allows you to jump 10 pages ahead, and if you run it to the end you will see that they only actually load 1000 pages. In other words they cache an estimate size, but require you to narrow the search to give you more precise results.

Related

What is the Optimized way to Paginate Active Record Objects with Filter?

I want to display the Users list in pagination with my rails API, However I have few constraints here before displaying the users I want to check users who have access to the view files, Here is the code:
def verified_client
conditions = {}
conditions[:user_name] = fetch_verified_users_with_api_call # returns[user_1,user_2, ....]
#users = User.where(conditions).where('access NOT LIKE ?', 'admin_%').ordered
will_paginate(#users, params[:page])
end
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
I could have used limit and offset with the Query, but I will not know the total result and page size to achieve the same I have to fire one more SQL call to get count and write up own logic to get number of pages.
Generated SQL:
select *
from users
where user_name IN (user_1, user_2 .... user_10000)
AND (access NOT LIKE 'admin_%')
That query is hard to optimize. It probably does essentially all the work for each page and there is no good way to prevent this scan. Adding these may help:
INDEX(access)
INDEX(user, access)
I have seen 70K items in an IN list, but I have not heard of 1M. What is going on? Would it be shorter to say which users are not included? Could there be another table with the user list? (Sometimes a JOIN works better than IN, especially if you have already run a Select to get the list.)
Could the admins be filtered out of the IN list before building this query? Then,
INDEX(user)
is likely to be quite beneficial.
Is there at most one row per user? If so, then pagination can be revised to be very efficient. This is done by "remembering where you left off" instead of using OFFSET. More: http://mysql.rjweb.org/doc.php/pagination
Q1) Is there a way where I don't have to make sql call when users try
to fetch subsequent pages(page 2, page 3.. page n)?
The whole idea of pagination is that you make the query faster by returning a small subset of the total number of records. In most cases the number of requests for the first page will vastly outnumber the other pages so this could very well be a case of premature optimization that might do more harm then good.
If is actually a problem its better adressed with SQL caching, ETags or other caching mechanisms - not by loading a bunch of pages at once.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
Your database or application will very likely slow to a halt and then crash when it runs out of memory. Exactly what happens depends on your architecture and how grumpy your boss is on that given day.
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
You can get the whole result set and store it in your app. As far as the database is concerned this is not slow or non-optimal. Then performance including memory is your app's problem.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
What will happen is all those entries will be concatenated in the SQL string. There is likely a maximum SQL string size and a million entries would be too much.
A possible solution is if you have a way to identify the verified users in the database and do a join with that table.
What is the Optimized way to Paginate Active Record Objects with Filter?
The three things which are not premature optimizations with databases is (1) use indexed queries not table scans, (2) avoid correlated sub-queries, and (3) reduce network turns.
Make sure you have an index it can use, in particular for the order. So make sure you know what order you are asking for.
If instead of the access field starting with a prefix if you had a field to indicate an admin user you can make an index with the first field as that admin field and the second field as what you are ordering by. This allows the database to sort the records efficiently, especially important when paging with offset and limit.
As for network turns you might want to use paging and not worry about network turns. One idea is to prefetch the next page if possible. So after it gets the results of page 1, query for page 2. Hold the page 2 results until viewed, but when viewed then get the results for page 3.

Cosmos DB : Faster Search Options

We have huge cosmosDB container with billions of rows and almost 300 columns. Data is partitioned and modeled in a way we query it most of the time.
For example : User table is partitioned by userId thats why below query works fine.
Select * from User where userId = "user01234"
But in some cases, we need to query data differently that need sorting and then query.
For example : Get data from User Table using userpost and date of post
Select * from user where userPostId = "P01234" orderBy date limit 100
This query takes lot of time because of the size of data and data is not partitioned based on query2 (user Post).
My question is - How can we make query2 and other similar queries faster when data is not partitioned accordingly.
Option 1: "Create separate collection which is partitioned as per Query2" -
This will make query faster but for any new query we will end up creating a new collection, which is duplication of billions of records. [Costly Option]
Option 2: "Build elastic search on top of DB?" This is time consuming option and may be over killing for this slow query problem.
Is there any other option that can be used? Let me know your thoughts.
Thanks in advance!
Both options are expensive. The key is deciding which is cheaper, including running the cross-partition query. This will require you costing each of these options out.
For the cross-partition query, capture the RU charge in the response object so you know the cost of it.
For change feed, this will have an upfront cost as you run it over your existing collection, but whether that cost remains high depends on how much data is inserted or updated each month. Calculating the cost to populate your second collection will take some work. You can start by measuring the RU Charge in the response object when doing an insert then multiply by the number of rows. Calculating how much throughput you'll need will be a function of how quickly you want to populate your second collection. It's also a function of how much compute and how many instances you use to read and write the data to the second collection.
Once the second collection is populated, Change Feed will cost 2 RU/s to poll for changes (btw, this is configurable) and 1 RU/s to read each new item. The cost of inserting data into a second collection costs whatever it is when you measured it earlier.
If this second query doesn't get run that often and your data doesn't change that much, then change feed could save you money. If you run this query a lot and your data changes frequently too, change feed could still save you money.
With regards to Elastic Search or Azure Search, I generally find this can be more expensive than keeping the cross-partition query or change feed. Especially if you're doing it to just answer a second query. Generally this is a better option when you need true free text query capabilities.
A third option you might explore is using Azure Synapse Link and then run both queries using SQL Serverless or Spark.
Some other observations.
Unless you need all 300 properties in these queries you run, you may want to consider shredding these items into separate documents and storing as separate rows. Especially if you have highly asymmetric update patterns where only a small number of properties get frequently updated. This will save you a ton of money on updates because the smaller the item you update, the cheaper (and faster) it will be.
The other thing I would suggest is to look at your index policy and exclude every property that is not used in the where clause for your queries and include properties that are. This will have a dramatic impact on RU consumption for inserts. Also take a look at composite index for your date property as this has a dramatic impact on queries that use order by.

Using a Mysql result from NodeJS rather than slurping rows

TLDR; How do I SELECT on a Mysql table with Node.js without slurping the results? (If you say SELECT COUNT(id) then use OFFSET and LIMIT, no, that's silly. I want to HOLD the query result object and poke it after, not hammer the database to death re-SELECTing every few seconds. Some of these queries will take several seconds to run!)
What: I want to run a Mysql query on a nodejs service and basically attach the MYSQL_RES to the client's session context.
Why: The queries in question may yield tens of thousands of results. I'll only have 1 demanding client at a time, and the UI will only display ~30 results at a time in a scrollable/flickable list view.
How: In Qt it's standard practice to use a QSqlModel for circumstances such as this. Basically if you have a list (view) with this type of model attached, it will only "read" the results pertinent to the visible area. As the view is scrolled down it populates it with more results.
But: The NodeJS asynchronous method is lovely, but I have yet to find a way to get ONLY the result object (sans rows), and a respective method of "picking out" result rows or a span thereof arbitrarily.

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.

LINQtoSQL caching question

I have been doing a lot of reading but not coming up with any good answers on LinqToSql caching...I guess the best way to ask my question is to just ask it.
I have a jQuery script calling a WCF service based on info that the script is getting from the 1st two cells of each row of a table. Basically its looping through the table, calling the service with info from the table cells, and updating the row based on info returned from the service.
The service itself is running a query based on the info from the client basically in the form of:
Dim b = From r In db.batches _
Where r.TotalDeposit = amount _
And r.bDate > startDate AndAlso r.bDate < endDate _
Select r
Using firebug I noticed that each response was taking anywhere between 125ms-3secs per. I did some research and came across a article about caching LINQ objects and applied it to my project. I was able to return stuff like the count of the object (b.Count) as a Response in a page and noticed that it was caching, so I thought I was cooking with grease...however when I tried running the above query against the cached object the times became a consistent 700ms, too long.
I read somewhere that LINQ caches automatically so I did the following:
Dim t As New List(Of batch)
Dim cachedBatch = From d In db.batches _
Select d
t = From r In cachedBatch _
Where r.TotalDeposit = amount _
And r.bDate > startDate AndAlso r.bDate < endDate _
Select r
Return t
Now the query runs at a consistent 120-140ms response time...what gives??? I'm assuming its caching since running the query against the db takes a little while (< 35,000 records).
My question I guess then is, should I be trying to cache LINQ objects? Is there a good way to do so if I'm missing the mark?
As usual, thanks!!!
DO NOT USE the code in that linked article. I don't know what that person was smoking, but the code basically reads the entire contents of a table and chucks it in a memory cache. I can't think of a much worse option for a non-trivial table (and 35,000 records is definitely non-trivial).
Linq to SQL does not cache queries. Linq to SQL tracks specific entities retrieved by queries, using their primary keys. What this means is that if you:
Query the DataContext for some entities;
Change those entities (but don't call SubmitChanges yet);
Run another query that retrieves the same entities.
Then the results of #3 above will be the same entities you retrieved in (1) with the changes you made in (2) - in other words, you get back the existing entities that Linq is already tracking, not the old entities from the database. But it still has to actually execute the query in order to know which entities to load; change tracking is not a performance optimization.
If your database query is taking more than about 100 ms then the problem is almost certainly on the database side. You probably don't have the appropriate indexes on the columns that you are querying on. If you want to cache instead of dealing with the DB perf issue then you need to cache the results of specific queries, which you would do by keying them to the parameters used to create the query. For example (C#):
IEnumerable<Batch> GetBatches(DateTime startDate, DateTime endDate,
Decimal amount)
{
string cacheKey = string.Format("GetBatches-{0}-{1}-{2}",
startDate, endDate, amount);
IEnumerable<Batch> results = Cache[cacheKey];
if (results != null)
{
return results;
}
results = <LINQ QUERY HERE>.ToList();
Cache.Add(cacheKey, results, ...);
return results;
}
This is fine as long as the results can't be changed while the item is in the cache, or if you don't care about getting stale results. If this is an issue, then it starts to become a lot more complicated, and I won't get into all of the subtleties here.
The bottom line is, "caching" every single record in a table is not caching at all, it's turning an efficient relational database (SQL Server) into a sloppy, inefficient in-memory database (a generic list in a cache). Don't cache tables, cache queries if you need to, and before you even decide to do that, try to solve the performance issue in the database itself.
For the record I should also note that someone seems to have implemented a form of caching based on the IQueryable<T> itself. I haven't tested this method, and I'm not sure how much easier it would be than the above to use in practice (you still have to specifically choose to use it, it's not automatic), but I'm listing it as a possible alternative.