Dynamic Data in SOLR (like counts, viewcounts, etc) - NRT vs Redis? - mysql

We are creating a web application which would contain posts (something like FB or say Youtube). For the stable part of the data (i.e.the facets, search results & its content), we plan to use SOLR.
What should we use for the unstable part of the data (i.e. dynamic and volatile content such as Like counts, Comments counts, Viewcounts)?
Option 1) Redis
What about storing the "dynamic" data in a different data store (like Redis)? Thus, everytime the counts get refreshed, I do not have to reindex the data into SOLR at all. Thus SOLR indexing is only triggered when new posts are added to the site, and never on any activity on the posts by the users.
Side-note :-
I also looked at the SOLR-Redis plugin at https://github.com/sematext/solr-redis
The plugin looks good, but not sure if the plugin can be used to fetch the data stored in Redis as part of the solr result set, i.e. in docs. The description looks more like the Redis data can be used in the function queries for boosting, sorting, etc. Anyone has experience with this?
Option 2) SOLR NRT with Soft Commits
We would depend on the in-built NRT features. Let's say we do soft-commits every second and hard-commits every 10 seconds. Suppose huge amount of dynamic data is created on the site across hundreds of posts, e.g. 100000 likes across 10000 posts. Thus, this would mean soft-commiting on 10000 rows every second. And then hard-commiting those many rows every 10 seconds. Isn't this overkill?
Which option is preferred? How would you compare both options in terms of scalibility, maintenance, feasibility, best-practices, etc? Any real-life experiences or links to articles?
Many thanks!
p.s. EFF (external file fields) is not an option, as I read that the data in that file can only be used in function queries and cannot be returned as part of a document.

I would advice for you to go with redis for data that is changing frequently. One thing to keep in mind about Solr soft commits is that they invalidate some cache data and if you have a lot of it; opening a new searcher and building the new cache may be a bit more time consuming that you like.
Solr is great for full text search and going through data that requires tokenization. It's also pretty quick; however I don't think it is the right tool for this job.
You can also check out this blog post for more info on Solr commits.
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
Per the post:
Soft commits are about visibility, hard commits are about durability.
The thing to understand most about soft commits are that they will
make documents visible, but at some cost. In particular the “top
level” caches, which include what you configure in solrconfig.xml
(filterCache, queryResultCache, etc) will be invalidated! Autowarming
will be performed on your top level caches (e.g. filterCache,
queryResultCache), and any newSearcher queries will be executed. Also,
the FieldValueCache is invalidated, so facet queries will have to wait
until the cache is refreshed. With very frequent soft commits it’s
often the case that your top-level caches are little used and may, in
some cases, be eliminated. However, “segment level caches”, which
include function queries, sorting caches, etc are “per segment”, so
will not be invalidated on soft commit.

Redis => haven't explored this option
SOLR NRT with Soft Commits => It's kind of overkill and inefficient since it will be updating the complete document though only part of the document is updated every time. The more efficient way to handle this is by keeping these dynamic fields (like count, view count, etc) outside the Lucene index. There are two ways to handle this.
A. Using EFF (external file fields). In the post you have mentioned that:
EFF (external file fields) is not an option, as I read that the data in that file can only be used in function queries and cannot be returned as part of a document. If I'm not wrong, you want these dynamic field's corresponding value in the search response. We can get this by using field(exteranl_field_name) in fl parameter.
B. Using docValues.DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. DocValues are not part of the Lucene index. We can define these fields as docValue and use the partial update feature to just update these fields.
DocValue => https://solr.apache.org/guide/8_0/docvalues.html
EFF => https://solr.apache.org/guide/8_0/working-with-external-files-and-processes.html
Document Update => https://solr.apache.org/guide/6_6/updating-parts-of-documents.html

Related

How should something like SO's vote count be stored in a database?

I'm assuming votes on StackOverflow are relations between between users and posts. It would be expensive to count the votes for each page load, so I'm assuming it's cached somewhere. Is there a best practice for storing values that can be computed from other DB data?
I could store it in something like Redis, but then it'll be expensive to sort questions by votes.
I could store it as a new column in the posts table, but it'll be confusing to other engineers because derived values aren't typically stored with actual data.
I could create an entity-attribute-value table just for derived data, so I could join it with the posts table. There's a slight performance hit for the join and I don't like the idea of a table filled with unstructured data, since it would easily end up being filled with unused data.
I'm using MySQL 8, are there other options?
One more consideration is that this data doesn't need to be consistent, it's ok if the vote total is off slightly. So when a vote is created, the vote total doesn't need to be updated immediately, a job can run periodically to update the vote.
"Best practice" is very much situational, and often based on opinion. Here's how I look at it.
Your question seems to be about how to make a database-driven application perform at scale, and what trade-offs are acceptable.
I'd start by sticking to the relational, normalized data model for as long as you can. You say "It would be expensive to count the votes for each page load" - probably not that expensive, because you'll be joining on foreign keys, and unless you're talking about very large numbers of records and/or requests, that should scale pretty well.
If scalability and performance are challenges, I'd build a test rig, and optimize those queries, subject them to load and performance testing and add hardware capacity before doing anything else.
This is because normalized databases and applications without duplication/caching are easier to maintain, less likely to develop weird bugs, and easier to extend in future.
If you reach the point where that doesn't work anymore, I'd look at caching. There are a range of options here - you mention 3. The challenge is that once you reach the point where the normalized database because a performance bottleneck, there are usually lots of potential queries which become the bottleneck - if you optimize the "how many votes does a post get?" query, you move the problem to the "how many people have viewed this post?" query.
So, at this point I typically try to limit the requests to the database by caching in the application layer. This can take the form of a Redis cache. In descending order of effectiveness, you can:
Cache entire pages. This reduces the number of database hits dramatically, but is hard to do with a personalized site like SO.
Cache page fragments, e.g. the SO homepage has a few dozen questions; you could cache each question as a snippet of HTML, and assemble those snippets to render the page. This allows you to create a personalized page, by assembling different fragments for different users.
Cache query results. This means the application server would need to interpret the query results and convert to HTML; you would do this for caching data you'd use to assemble the page. For SO, for instance, you might cache "Leo Jiang's avatar path is x, and they are following tags {a, b, c}".
The problem with caching, of course, is invalidation and the trade-off between performance and up-to-date information. You can also get lots of weird bugs with caches being out of sync across load balancers.

Graphing login events, but need extra data. Rewrite history or post-process?

We have been tracking user login events for a while now in a MongoDB collection. Each event contains the userID, datetime, and a couple other fundamental attributes about the event.
For a new feature, we want to present a graph of these login events, with different groups representing cohorts related to the user who did the event. Specifically, we want to group by the "Graduation Year" attribute of the user.
In our event log, we do not record the Graduation Year of the user who's logging in, so cannot easily query that directly. We see two ways to go forward, plus a 3rd "in-between" option:
Instead of making a single MongoDB query to get the logins, we make that query PLUS a second one to our Relational DB to get the secondary user data we require, and merge the two together.
We could optionally query for all the users, load them into memory, and loop through the Events, or we could go through the events and find only the User IDs that logged in and query for those specific User IDs. (Then loop again, merging them in.)
The post-processing could be done on the server-side or we could send all the data to the client. (Currently our plan is to just send the raw event data to the client for processing into the graph.)
Upsides: The event log is made to track events. User "Graduation Year" is not relevant to the event in question; it's relevant to the user who did the event. This seems to separate concerns more properly. As well, if we later decide we want to group on a different piece of metadata (let's say: male vs female), it's easy to just join that data in as well.
Downsides: Part of the beauty of our event log is that it quickly can spit out tons of aggregate data that's ready-to-use. If there are 10,000 users, we may have 100,000 logins. It seems crazy to need to loop through 100,000 logins whenever this data is requested new (as in, not cached).
We can write a script that does a one-time load of all the events (presumably in batches), then requests the user metadata and merges it in, re-writing the Event Log to include the relevant data.
Upsides: The event log is our single point of interaction when loading the data. Client requests all the logins; gets 100,000 rows; sorts them and groups them according to Graduation Year; [Caches it;] and graphs it. Will have a script ready to re-add more data if it came to that, down the road.
Downsides: We're essentially rewriting history. We're polluting our event log with secondary data that isn't explicitly about the event we claim to be tracking. Need to rewrite or modify the script to add more data that we didn't know we wanted to track, if we had to, down the road.
We replicate the Users table in MongoDB, perhaps only as-needed (say when an event's metadata is unavailable), and do a join (I guess that's a "$lookup" in Mongo) to this table.
Upsides: MongoDB does the heavy lifting of merging the data.
Downsides: We need to replicate and keep-up-to-date, somehow, a secondary collection of our Users' relevant metadata. I don't think MongoDB's $lookup works like a join in MySQL, and maybe isn't really any more performant at all? Although I'd look into this before we implemented.
For the sake of estimation, let's just say that any given visitor to our site will never have to load more than 100,000 logins and 10,000 users.
For what it's worth, Option #2 seems most preferable to me, even though it involves rewriting history, for performance reasons. Although I am aware that, at some point, if we were sending a user's browser multiple years of login data (that is, all 100,000 imaginary logins), maybe that's already too much data for their browser to process and render quickly, and perhaps we'd already be better off grouping it and aggregating it as some sort of regularly-scheduled process on the backend. (I don't know!)
As a Data Warehouse, 100K rows, is quite small.
Performance in DW depends on building and maintaining "Summary Tables". This makes a pre-determined set of possibly queries very efficient, without having to scan the entire 'Fact' table. My discussion of Summary Tables (in MySQL): http://mysql.rjweb.org/doc.php/summarytables

Database solutions for a prospective (not retrospective) search

Let's say we have a requirement to create a system that consumes a high-volume, real-time data stream of documents, and that matches those documents against a set of user-defined search queries as those documents become available. This is a prospective, as opposed to a retrospective, search service. What would be an appropriate persistence solution?
Suppose that users want to see a live feed of documents that match their queries--think Google Alerts--and that the feed must display certain metadata for each document. Let's assume an indefinite lifespan for matches; i.e., the system will allow the user to see all of the matches for a query from the time when the particular query was created. So the metadata for each document that comes in the stream, and the associations between the document and the user queries that matched that document, must be persisted to a database.
Let's throw in another requirement, that users want to be able to facet on some of the metadata: e.g., the user wants to see only the matching documents for a particular query whose metadata field "result type" equals "blog," and wants a count of the number of blog matches.
Here are some hypothetical numbers:
200,000 new documents in the data stream every day.
-The metadata for every document is persisted.
1000 users with about 5 search queries each: about 5000 total user search queries.
-These queries are simple boolean queries.
-As each new document comes in, it is processed against all 5000 queries to see which queries are a match.
Each feed--one for each user query--is refreshed to the user every minute. In other words, for every feed, a query to the database for the most recent page of matches is performed every minute.
Speed in displaying the feed to the user is of paramount importance. Scalability and high availability are essential as well.
The relationship between users and queries is relational, as is the relationship between queries and matching documents, but the document metadata itself are just key-value pairs. So my initial thought was to keep the relational data in a relational DB like MySQL and the metadata in a NoSQL DB, but can the faceting requirement be achieved in a NoSQL DB? Also, constructing a feed would then require making a call to two separate data stores, which is additional complexity. Or perhaps shove everything into MySQL, but this would entail lots of joins and counts. If we store all the data as key-value pairs in some other kind of data store, again, how would we do the faceting? And there would be a ton of redundant metadata for documents that match more than one search query.
What kind of database(s) would be a good fit for this scenario? I'm aware of tools such as Twitter Storm and Yahoo's S4, which could be used to construct the overall architecture of such a system, but I'd like to focus on the database, given the data storage, volume, and query/faceting requirements.
First, I disagree with Ben. 200k new records per day compares with 86,400 seconds in a day, so we are talking about three records per second. This is not earth shattering, but it is a respectable clip for new data.
Second, I think this is a real problem that people face. I'm not going to be one that says that this forum is not appropriate for the topic.
I think the answer to the question has a lot to do with the complexity and type of user queries that are supported. If the queries consist of a bunch of binary predicates, for instance, then you can extract the particular rules from the document data and then readily apply the rules. If, on the other hand, the queries consist of complex scoring over the text of the documents, then you might need an inverted index paired with a scoring algorithm for each user query.
My approach to such a system would be to parse the queries into individual data elements that can be determined from each document (which I might call a "queries signature" since the results would contain all fields needed to satisfy the queries). This "queries signature" would be created each time a document was loaded, and it could then be used to satisfy the queries.
Adding a new query would require processing all the documents to assign new values. Given the volume of data, this might need to be more of a batch task.
Whether SQL is appropriate depends on the features that you need to extract from the data. This in turn depends on the nature of the user queries. It is possible that SQL is sufficient. On the other hand, you might need more sophisticated tools, especially if you are using text mining concepts for the queries.
Thinking about this, it sounds like an event-processing task, rather than a regular data processing operation, so it might be worth investigating Complex Event Processing systems - rather than building everything on a regular database, using a system which processes the queries on the incoming data as it streams into the system. There are commercial systems which can hit the speed & high-availability criteria, but I haven't researched the available OSS options (luckily, people on quora have done so).
Take a look at Elastic Search. It has a percolator feature that matches a document against registered queries.
http://www.elasticsearch.org/blog/2011/02/08/percolator.html

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.