Effectively fetching large number of tuples from Solr - mysql

I am stuck in a rather tricky problem. I am implementing a feature in my website, wherein, a person get all the results matching a particular criteria. The matching criteria can be anything. However, for the sake of simplicity, let's call the matching criteria as 'age'. Which means, the feature will return all the students names, from database (which is in hundreds of thousands) with the student whose age matches 'most' with the parameter supplied, on top.
My approaches:
1- I have a Solr server. Since I need to implement this in a paginated way, I would need to query Solr several times (since my solr page size is 10) to find the 'near-absolute' matching student real-time. This is computationally very intensive. This problem boils down to effectively fetching this large number of tuples from Solr.
2- I tried processing it in a batch (and by increasing the solr page size to 100). This data received is not guaranteed to be real-time, when somebody uses my feature. Also, to make it optimal, I would need to have data learning algos to find out which all users are 'most likely' to use my feature today. Then I'll batch process them on priority. Please do remember that number of users are so high that I cannot run this batch for 'all' the users everyday.
On one hand where I want to show results real-time, I have to compromise on performance (hitting Solr multiple times, thus slightly unfeasible), while on the other, my result set wouldn't be real-time if I do a batch processing, plus I can't do it everyday, for all the users.
Can someone correct my seemingly faulty approaches?
Solr indexing is done on MySQL db contents.

As I understand it, your users are not interested in 100K results. They only want the top-10 (or top-100 or a similar low number) results, where the person's age is closest to a number you supply.
This sounds like a case for Solr function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries. For the age example, that would be something like sort=abs(sub(37, age)) desc, score desc, which would return the persons with age closest to 37 first and prioritize by score in case of ties.

I think what you need is using solr cursors which will enable you to paginate effectively through large resultsets Solr cursors or deep paging

Related

What is the Optimized way to Paginate Active Record Objects with Filter?

I want to display the Users list in pagination with my rails API, However I have few constraints here before displaying the users I want to check users who have access to the view files, Here is the code:
def verified_client
conditions = {}
conditions[:user_name] = fetch_verified_users_with_api_call # returns[user_1,user_2, ....]
#users = User.where(conditions).where('access NOT LIKE ?', 'admin_%').ordered
will_paginate(#users, params[:page])
end
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
I could have used limit and offset with the Query, but I will not know the total result and page size to achieve the same I have to fire one more SQL call to get count and write up own logic to get number of pages.
Generated SQL:
select *
from users
where user_name IN (user_1, user_2 .... user_10000)
AND (access NOT LIKE 'admin_%')
That query is hard to optimize. It probably does essentially all the work for each page and there is no good way to prevent this scan. Adding these may help:
INDEX(access)
INDEX(user, access)
I have seen 70K items in an IN list, but I have not heard of 1M. What is going on? Would it be shorter to say which users are not included? Could there be another table with the user list? (Sometimes a JOIN works better than IN, especially if you have already run a Select to get the list.)
Could the admins be filtered out of the IN list before building this query? Then,
INDEX(user)
is likely to be quite beneficial.
Is there at most one row per user? If so, then pagination can be revised to be very efficient. This is done by "remembering where you left off" instead of using OFFSET. More: http://mysql.rjweb.org/doc.php/pagination
Q1) Is there a way where I don't have to make sql call when users try
to fetch subsequent pages(page 2, page 3.. page n)?
The whole idea of pagination is that you make the query faster by returning a small subset of the total number of records. In most cases the number of requests for the first page will vastly outnumber the other pages so this could very well be a case of premature optimization that might do more harm then good.
If is actually a problem its better adressed with SQL caching, ETags or other caching mechanisms - not by loading a bunch of pages at once.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
Your database or application will very likely slow to a halt and then crash when it runs out of memory. Exactly what happens depends on your architecture and how grumpy your boss is on that given day.
Q1) Is there a way where I don't have to make sql call when users try to fetch subsequent pages(page 2, page 3.. page n)?
You can get the whole result set and store it in your app. As far as the database is concerned this is not slow or non-optimal. Then performance including memory is your app's problem.
Q2) What would happen when verified_users list return million on items? I suspect the SQL will fail
What will happen is all those entries will be concatenated in the SQL string. There is likely a maximum SQL string size and a million entries would be too much.
A possible solution is if you have a way to identify the verified users in the database and do a join with that table.
What is the Optimized way to Paginate Active Record Objects with Filter?
The three things which are not premature optimizations with databases is (1) use indexed queries not table scans, (2) avoid correlated sub-queries, and (3) reduce network turns.
Make sure you have an index it can use, in particular for the order. So make sure you know what order you are asking for.
If instead of the access field starting with a prefix if you had a field to indicate an admin user you can make an index with the first field as that admin field and the second field as what you are ordering by. This allows the database to sort the records efficiently, especially important when paging with offset and limit.
As for network turns you might want to use paging and not worry about network turns. One idea is to prefetch the next page if possible. So after it gets the results of page 1, query for page 2. Hold the page 2 results until viewed, but when viewed then get the results for page 3.

Cosmos DB : Faster Search Options

We have huge cosmosDB container with billions of rows and almost 300 columns. Data is partitioned and modeled in a way we query it most of the time.
For example : User table is partitioned by userId thats why below query works fine.
Select * from User where userId = "user01234"
But in some cases, we need to query data differently that need sorting and then query.
For example : Get data from User Table using userpost and date of post
Select * from user where userPostId = "P01234" orderBy date limit 100
This query takes lot of time because of the size of data and data is not partitioned based on query2 (user Post).
My question is - How can we make query2 and other similar queries faster when data is not partitioned accordingly.
Option 1: "Create separate collection which is partitioned as per Query2" -
This will make query faster but for any new query we will end up creating a new collection, which is duplication of billions of records. [Costly Option]
Option 2: "Build elastic search on top of DB?" This is time consuming option and may be over killing for this slow query problem.
Is there any other option that can be used? Let me know your thoughts.
Thanks in advance!
Both options are expensive. The key is deciding which is cheaper, including running the cross-partition query. This will require you costing each of these options out.
For the cross-partition query, capture the RU charge in the response object so you know the cost of it.
For change feed, this will have an upfront cost as you run it over your existing collection, but whether that cost remains high depends on how much data is inserted or updated each month. Calculating the cost to populate your second collection will take some work. You can start by measuring the RU Charge in the response object when doing an insert then multiply by the number of rows. Calculating how much throughput you'll need will be a function of how quickly you want to populate your second collection. It's also a function of how much compute and how many instances you use to read and write the data to the second collection.
Once the second collection is populated, Change Feed will cost 2 RU/s to poll for changes (btw, this is configurable) and 1 RU/s to read each new item. The cost of inserting data into a second collection costs whatever it is when you measured it earlier.
If this second query doesn't get run that often and your data doesn't change that much, then change feed could save you money. If you run this query a lot and your data changes frequently too, change feed could still save you money.
With regards to Elastic Search or Azure Search, I generally find this can be more expensive than keeping the cross-partition query or change feed. Especially if you're doing it to just answer a second query. Generally this is a better option when you need true free text query capabilities.
A third option you might explore is using Azure Synapse Link and then run both queries using SQL Serverless or Spark.
Some other observations.
Unless you need all 300 properties in these queries you run, you may want to consider shredding these items into separate documents and storing as separate rows. Especially if you have highly asymmetric update patterns where only a small number of properties get frequently updated. This will save you a ton of money on updates because the smaller the item you update, the cheaper (and faster) it will be.
The other thing I would suggest is to look at your index policy and exclude every property that is not used in the where clause for your queries and include properties that are. This will have a dramatic impact on RU consumption for inserts. Also take a look at composite index for your date property as this has a dramatic impact on queries that use order by.

How to make a paged Select query and get aggregated results from many shards

In a sharded environment data will be splited to various machines/shards. I want to know how can I create a query that returns a paged results (ex 2nd page, 10 results or 10th page, 20 results)?
I know that it has to do with the primary key. With a single RDBMS it's easy because you have a auto-increment column so it's easy to get get the last 10 items and return paged data.
I work for ScaleBase, which is a maker of a complete scale-out solution an "automatic sharding machine" if you like, analyzes the data and SQL stream, splits the data across DB nodes, load-balances reads, and aggregates results in runtime – so you won’t have to!
You can see my answer to this thread about auto increment: Sharding and ID generation as instagram
Also, take a look on my post in http://database-scalability.blogspot.com/ about Pinterest, then and now...
Specifically - merging results from several shards to 1 result is HELL. May edge cases, GROUP BY, ORDER BY, JOINs, LIMIT, HAVING. I must say that in SB we support most of combinations, it took us ages. True, we need to do it generically, while you can "bend" to proprietary... but still...

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.