Using a Mysql result from NodeJS rather than slurping rows - mysql

TLDR; How do I SELECT on a Mysql table with Node.js without slurping the results? (If you say SELECT COUNT(id) then use OFFSET and LIMIT, no, that's silly. I want to HOLD the query result object and poke it after, not hammer the database to death re-SELECTing every few seconds. Some of these queries will take several seconds to run!)
What: I want to run a Mysql query on a nodejs service and basically attach the MYSQL_RES to the client's session context.
Why: The queries in question may yield tens of thousands of results. I'll only have 1 demanding client at a time, and the UI will only display ~30 results at a time in a scrollable/flickable list view.
How: In Qt it's standard practice to use a QSqlModel for circumstances such as this. Basically if you have a list (view) with this type of model attached, it will only "read" the results pertinent to the visible area. As the view is scrolled down it populates it with more results.
But: The NodeJS asynchronous method is lovely, but I have yet to find a way to get ONLY the result object (sans rows), and a respective method of "picking out" result rows or a span thereof arbitrarily.

Related

What is the Optimized way to Paginate Active Record Objects with Filter?

I want to display the Users list in pagination with my rails API, However I have few constraints here before displaying the users I want to check users who have access to the view files, Here is the code:
def verified_client
conditions = {}
conditions[:user_name] = fetch_verified_users_with_api_call # returns[user_1,user_2, ....]
#users = User.where(conditions).where('access NOT LIKE ?', 'admin_%').ordered
will_paginate(#users, params[:page])
end
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
I could have used limit and offset with the Query, but I will not know the total result and page size to achieve the same I have to fire one more SQL call to get count and write up own logic to get number of pages.
Generated SQL:
select *
from users
where user_name IN (user_1, user_2 .... user_10000)
AND (access NOT LIKE 'admin_%')
That query is hard to optimize. It probably does essentially all the work for each page and there is no good way to prevent this scan. Adding these may help:
INDEX(access)
INDEX(user, access)
I have seen 70K items in an IN list, but I have not heard of 1M. What is going on? Would it be shorter to say which users are not included? Could there be another table with the user list? (Sometimes a JOIN works better than IN, especially if you have already run a Select to get the list.)
Could the admins be filtered out of the IN list before building this query? Then,
INDEX(user)
is likely to be quite beneficial.
Is there at most one row per user? If so, then pagination can be revised to be very efficient. This is done by "remembering where you left off" instead of using OFFSET. More: http://mysql.rjweb.org/doc.php/pagination
Q1) Is there a way where I don't have to make sql call when users try
to fetch subsequent pages(page 2, page 3.. page n)?
The whole idea of pagination is that you make the query faster by returning a small subset of the total number of records. In most cases the number of requests for the first page will vastly outnumber the other pages so this could very well be a case of premature optimization that might do more harm then good.
If is actually a problem its better adressed with SQL caching, ETags or other caching mechanisms - not by loading a bunch of pages at once.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
Your database or application will very likely slow to a halt and then crash when it runs out of memory. Exactly what happens depends on your architecture and how grumpy your boss is on that given day.
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
You can get the whole result set and store it in your app. As far as the database is concerned this is not slow or non-optimal. Then performance including memory is your app's problem.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
What will happen is all those entries will be concatenated in the SQL string. There is likely a maximum SQL string size and a million entries would be too much.
A possible solution is if you have a way to identify the verified users in the database and do a join with that table.
What is the Optimized way to Paginate Active Record Objects with Filter?
The three things which are not premature optimizations with databases is (1) use indexed queries not table scans, (2) avoid correlated sub-queries, and (3) reduce network turns.
Make sure you have an index it can use, in particular for the order. So make sure you know what order you are asking for.
If instead of the access field starting with a prefix if you had a field to indicate an admin user you can make an index with the first field as that admin field and the second field as what you are ordering by. This allows the database to sort the records efficiently, especially important when paging with offset and limit.
As for network turns you might want to use paging and not worry about network turns. One idea is to prefetch the next page if possible. So after it gets the results of page 1, query for page 2. Hold the page 2 results until viewed, but when viewed then get the results for page 3.

Is getting the table size in JPA an expensive operation?

I implement server-side pagination for table viewing in my web application. This means the user has buttons to activate first-page, last-page, next-page, and prior-page. Each click results in a server request where only the records to be shown are returned.
To implement that "last page" function and a scroll bar I need the client to have the size of the table. I can get this on the server-side with the following method:
public long getCount(Class entityClass) {
CriteriaBuilder builder = em.getCriteriaBuilder();
CriteriaQuery<Long> query = builder.createQuery(Long.class);
Root root = query.from(entityClass);
Expression<Long> count = builder.count(root);
query.select(count);
TypedQuery<Long> typedQuery = em.createQuery(query);
return typedQuery.getSingleResult();
}
This table could be very active with millions of records. Does running this function cause a lot of CPU cycles in the SQL server to be utilized?
The concern is how well this application will scale.
That depends entirely on the database, all JPA implementations I know will translate count to select count(*) from Table. We have a Postgresql with single table with 130GB data, and most rows are only a few kilobytes. Doing select (*) from table takes minutes; a developer once did a simple undeindexed select query, and a full table scan takes about 45 minutes.
When doing pagination, you often have a filter, and it is important to apply the same fileter to both the data-query and the count-query (one of the main reasons for using CriteriaBuilder is to share the filtering part of the query). Today I would recommend using Spring-data, since it makes pagination almost effortlessly.
If you have a lots of data, you can do like google, which say there are 1.340.000.000 results for 'zip', but only allows you to jump 10 pages ahead, and if you run it to the end you will see that they only actually load 1000 pages. In other words they cache an estimate size, but require you to narrow the search to give you more precise results.

Effectively fetching large number of tuples from Solr

I am stuck in a rather tricky problem. I am implementing a feature in my website, wherein, a person get all the results matching a particular criteria. The matching criteria can be anything. However, for the sake of simplicity, let's call the matching criteria as 'age'. Which means, the feature will return all the students names, from database (which is in hundreds of thousands) with the student whose age matches 'most' with the parameter supplied, on top.
My approaches:
1- I have a Solr server. Since I need to implement this in a paginated way, I would need to query Solr several times (since my solr page size is 10) to find the 'near-absolute' matching student real-time. This is computationally very intensive. This problem boils down to effectively fetching this large number of tuples from Solr.
2- I tried processing it in a batch (and by increasing the solr page size to 100). This data received is not guaranteed to be real-time, when somebody uses my feature. Also, to make it optimal, I would need to have data learning algos to find out which all users are 'most likely' to use my feature today. Then I'll batch process them on priority. Please do remember that number of users are so high that I cannot run this batch for 'all' the users everyday.
On one hand where I want to show results real-time, I have to compromise on performance (hitting Solr multiple times, thus slightly unfeasible), while on the other, my result set wouldn't be real-time if I do a batch processing, plus I can't do it everyday, for all the users.
Can someone correct my seemingly faulty approaches?
Solr indexing is done on MySQL db contents.
As I understand it, your users are not interested in 100K results. They only want the top-10 (or top-100 or a similar low number) results, where the person's age is closest to a number you supply.
This sounds like a case for Solr function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries. For the age example, that would be something like sort=abs(sub(37, age)) desc, score desc, which would return the persons with age closest to 37 first and prioritize by score in case of ties.
I think what you need is using solr cursors which will enable you to paginate effectively through large resultsets Solr cursors or deep paging

Node.js Rendering big amount of JSON data from the server

I have an view that has a for loop that inserts rows to a table. The table is very big and already consisting of couple of thousand of rows.
When I run it, the server throws out of memory exception.
I would like to add an infinite scrolling feature so I won't have to load all the data at once.
Right now the data is being sent with regular res.render(index.ejs, data) (data is JSON)
I can figure out the infinite scrolling part, but how do I get the JSON data in chunks from the server ?
I am using node.js with express and ejs as template engine.
I am open for using any framework that will aid me through the process (was particularly checking out Angualr.js).
Thanks
Firstly, there is an angular component for infinite scroll: http://ngmodules.org/modules/ngInfiniteScroll
Then, you have to change you backend query to look something like:
http://my.domain.com/items?before=1392382383&count=50
This essentially tells your server to fetch items created/published/changed/whatever before the given timestamp and return only 50 of them. That means you back-end entities (be them blog entries, products etc) need to have some short of natural ordering in a continuous space (a publication date timestamp is almost continuous). This is very important, cause even if you use a timestamp, you may end-up with extreme heisenbugs where items are rendered twice (if you use <= that's definate), loose entries (if you use < and 2 entries on the edges of your result sets are on the same timestamp) or even load the same items again and again (more than 50 items on the same timestamp). You have to take care of such cases by filtering duplicates.
Your server-side code translates this into a query like (DB2 SQL of course):
SELECT * FROM ITEMS
WHERE PUBLICATION_DATE <= 1392382383
ORDER BY PUBLICATION_DATE DESC
FETCH FIRST 50 ROWS ONLY
When infinite scroll reaches the end of the page and calls your registered callback, you create this $http.get request by taking into account the last item of your already loaded items. For the first query, you can use the current timestamp.
Another approach is to simply send the id of the last item, like:
http://my.domain.com/items?after_item=1232&count=50
and let the server decide what to do. I suppose you can use NoSQL storage like Redis to answer this kind of query very fast and without side-effects.
That's the general idea. I hope it helps.

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.