Say I have a table of 100000 cities (id, city) and my application is laid out in a way that I require the id->city mapping for many searches. Running a SQL record for each such translation is prohibitory. Say a search result page displays 1000 records, I don't want to add 1000 SQL queries on top of the existing query.
I've read about the eager loading primitive (:includes) and that doesn't' exactly fit my need. I want to have this table of 100000 cities resident in memory. At present I am trying to avoid something like redis to save myself from one dependency.
Is there a shared memory area where I can shove in this hash table when rails/passenger starts and then all incoming requests can lookup this persistent hash from there.
I believe if I create such an hash in my application_controller, this hash will be initialized every time a request comes in which will make things worse than what I have now ( or will it not ?)
What is the rails way of instantiating a shared persistent memory which all requests can share ?
It sounds like you need caching, you're just trying to avoid an external server. I've used Rails' built-in memory store cache solution for large calculations that are globally valid but I don't want to make a roundtrip to an external cache for.
Use and adapt this setting in your environment.
config.cache_store = :memory_store, { size: 64.megabytes }
And of course, the Rails Caching Guide has all the details.
Related
I want to store user's profiles in redis, as I have to frequently read multiple user's profiles.. there are two options I see at present:
Option 1: - store separate hash key per user's profile
[hash] - u1 profile {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - u2 profile {id: u2, name:user2, email:user2#domain.com, photo:url}
where for every user's id is hash key and profile field and values JSON-serialized profile objects. (OR instead of json user field-value pairs)
Option 2: - use single hash key to store all users profile
[hash] - users-profile u1 {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - users-profile u2 {id:u2, name:user2, email:user2#domain.com, photo:url}
where in users-profile hash key, user's ids field and values JSON-serialized profile objects.
Please tell me which option is best considering following:
performance
memory utilization
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time
larger dataset - what if there are millions users profile
As Sergio Tulentsev pointed out, its not good to store all the user's data (especially if the dataset is huge) inside one single hash by any means.
Storing the users data as individual keys is also not preferred if your looking for memory optimization as pointed out in this blog post
Reading the user's data using pagination mechanism demands one to use a database rather than a simple caching system like redis. Hence it's recommended to use a NoSQL database such as mongoDB for this.
But reading from the database each time is a costly operation especially if you're reading a lot of records.
Hence the best solution would be to cache the most active user's data in redis to eliminate the database fetch overhead.
I recommend you looking into walrus .
It basically follows the following pattern:
#cache.cached(timeout=expiry_in_secs)
def function_name(param1, param2, ...., param_n):
# perform database fetch
# return user data
This ensures that the frequently accessed or requested user data is in redis and the function automatically returns the value from redis than making the database call. Also the key is expired if not accessed for a long time.
You set it up as follows:
from walrus import *
db = Database(host='localhost', port=6379, db=0)
where host can take the domain name of the redis cluster running remotely.
Hope this helps.
Option #1.
Performance: Typically it depends on your use case but let say that you want to read a specific user (on the login/logout, authorization purposes, etc). With option #1, you simply compute the user hash and get the user profile. With option #2, you will need to get all users profiles and parse the json (although you can make it efficient it would never be so efficient and simpler as option #1);
Memory utilization: You can make option #1 and option #2 take the same size in redis (on option #1, you can avoid storing the hash/user id as part of the json). However, and picking the same example to load a specific user, you just need to in code/memory a single user profile json instead of a bigger json with a set of user profiles
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time: For this, as typically is done with a relational database, you want to do paging. There are different ways of doing paging with redis but using a scan operation is an easy way to iterate over a set of users
larger dataset - what if there are millions users profile:
Redis is an in-memory but persistent on disk database, so it
represents a different trade off where very high write and read speed
is achieved with the limitation of data sets that can't be larger than
memory
If you "can't have a dataset larger the memory", you can look to Partitioning as the Redis FAQ suggests. On the Redis FAQ you can also check other metrics such as the "maximum number of keys a single Redis instance can hold" or "Redis memory footprint"
PROS for option 1
(But don't use hash, use single key. Like SET profile:4d094f58c96767d7a0099d49 {...})
Iterating keys is slightly faster than iterating hash. (That's also why you should modify option 1 to use SET, not HSET)
Retrieving key value is slightly faster than retrieving hash field
PROS for option 2
You can get all users in a single call with HMGET, but only if your user base is not very big. Otherwise it can be too hard for server to serve you the result.
You can flush all users in a single command. Useful if you have backing DB.
PROS for option 3
Option 3 is to break your user data in hash buckets determined by hash from user id. Works good if you have many users and do batches often. Like this:
HSET profiles:<bucket> <id> {json object}
HGET profiles:<bucket> <id>
HMGET profiles:<bucket>
The last one to get a whole bucket of profiles. Don't recommend it to be more than 1mb in total. Works good with sequential ids, not so good with hashes, because they can grow too much. If you used it with hashes and it grew too much that this slows your Redis, you can fallback to HSCAN (like in option2) or redistribute objects to more buckets with new hash function.
Faster batch load
Slightly slower single object store/load
My recommendation, if I got your situation right, is to use 3rd option with sequential ids of range 100. And if you aiming at hight amounts of data, plan for cluster from day one.
I've got a web application, it has the normal feature, user settings etc these are all stored in MYSQL with the user etc.....
A particular part of the application a is a table of data for the user to edit.
I would like to make this table real time, across multiple users. Ie multiple users can open the page edit the data and see changes in real time done by other users editing the table.
My thinking is to cache the data for the table in Redis, then preform all the actions in redis like keeping all the clients up to date.
Once all the connection have closed for a particular table save the data back to mysql for persistence, I know Redis can be used as a persistent NoSQL database but as RAM is limited and all my other data is stored in MYSQL, mysql seems a better option.
Is this a correct use case for redis? Is my thinking correct?
It depends on the scalability. The number of records you are going to deal with and the structure you are going to use for saving it.
I will discuss about pros and crons of using redis. The decision is up to you.
Advantages of using redis:
1) It can handle heavy writes and reads in comparison with MYSQL
2) It has flexible structures (hashmap, sorted set etc) which can
localise your writes instead of blocking the whole table.
3) Read queries will be much faster as it is served from cache.
Disadvantages of using redis:
1) Maintaining transactions. What happens if both users try to access a
particular cell at a time? Do you have a right data structure in redis to
handle this case?
2) What if the data is huge? It will exceed the memory limit.
3) What happens if there is a outage?
4) If you plan for persistence of redis. Say using RDB or AOF. Will you
handle those 5-10 seconds of downtime?
Things to be focussed:
1) How much data you are going to deal with? Assume for a table of 10000 rows wit 10 columns in redis takes 1 GB of memory (Just an assumption actual memory will be very much less). If your redis is 10GB cluster then you can handle only 10 such tables. Do a math of about how many rows * column * live tables you are going to work with and the memory it consumes.
2) Redis uses compression for data within a range http://redis.io/topics/memory-optimization. Let us say you decide to save the table with a hashmap, you have two options, for each column you can have a hashmap or for each row you can have a hashmap. Second option will be the optimal one. because storing 1000 (hashmaps -> rows) * 20 (records in each hash map -> columns) will take 10 time less memory than storing in the other way. Also in this way if a cell is changed you can localize in hashmap of within 20 values.
3) Loading the data back in your MYSQL. how often will this going to happen? If your work load is high then MYSQL begins to perform worse for other operations.
4) How are you going to deal with multiple clients on notifying the changes? Will you load the whole table or the part which is changes? Loading the changed part will be the optimal one. In this case, where will you maintain the list of cells which have been altered?
Evaluate your system with these questions and you will find whether it is feasible or not.
I have a Mysql table which holds about 10 million records currently. Records are inserted by another batch application on continues basis and keep on growing.
On the front end user can search the data on this table based on different criterion. I am using query DSL and JPA repository to create dynamic queries and getting data from the table. But the performance of the query with pagination is very slow. I have tried indexing, InnoDB related tweaks,session management by HikariCP and ehcahe solutions but still it is taking about 100 seconds to get the data.
Also entities are simple POJO with no relation with other entities.
What is the best way/technology/framework to implement this scenario?
In a table of this size, dynamic query is a really, REALLY bad idea, you need to really control the access to the table and avoid table scans at all cost.
Ultimately, this sounds like a data warehouse solution, whereas the data is ETL'ed into a report-like format and not raw transactional data. Even so, you'll still need to define the access patterns you need and design your DWH to support that.
If you decide that the raw data is still the best format, another approach would be to define support metadata tables that could be queried to more quickly reduce the number of rows returned.
Could also look at clustering the data if you can find some way to logically break out data into chunks. However, when you say dynamic queries, this might not be possible.
My suggestion would be to create a dedicated cache and the web app should query the cache instead of the DB. If the ETL batch to your main table is at a defined period, you can keep the cache hot by triggering a loading from the main table to the cache. This can any in memory cache like Ignite or Infinispan.
However, this is not a sustainable solution and eventually you would need to restrict your users into seeing data within a manageable date range only and will have to either discard or send the old data asynchronous via flat file generated reports.
Not the entire history of the huge dataset can be made available in the UI to the user.
You could also try data virtualization tools to figure out what users are more comfortable with before deciding on the partition strategy in production.
We are creating a web application which would contain posts (something like FB or say Youtube). For the stable part of the data (i.e.the facets, search results & its content), we plan to use SOLR.
What should we use for the unstable part of the data (i.e. dynamic and volatile content such as Like counts, Comments counts, Viewcounts)?
Option 1) Redis
What about storing the "dynamic" data in a different data store (like Redis)? Thus, everytime the counts get refreshed, I do not have to reindex the data into SOLR at all. Thus SOLR indexing is only triggered when new posts are added to the site, and never on any activity on the posts by the users.
Side-note :-
I also looked at the SOLR-Redis plugin at https://github.com/sematext/solr-redis
The plugin looks good, but not sure if the plugin can be used to fetch the data stored in Redis as part of the solr result set, i.e. in docs. The description looks more like the Redis data can be used in the function queries for boosting, sorting, etc. Anyone has experience with this?
Option 2) SOLR NRT with Soft Commits
We would depend on the in-built NRT features. Let's say we do soft-commits every second and hard-commits every 10 seconds. Suppose huge amount of dynamic data is created on the site across hundreds of posts, e.g. 100000 likes across 10000 posts. Thus, this would mean soft-commiting on 10000 rows every second. And then hard-commiting those many rows every 10 seconds. Isn't this overkill?
Which option is preferred? How would you compare both options in terms of scalibility, maintenance, feasibility, best-practices, etc? Any real-life experiences or links to articles?
Many thanks!
p.s. EFF (external file fields) is not an option, as I read that the data in that file can only be used in function queries and cannot be returned as part of a document.
I would advice for you to go with redis for data that is changing frequently. One thing to keep in mind about Solr soft commits is that they invalidate some cache data and if you have a lot of it; opening a new searcher and building the new cache may be a bit more time consuming that you like.
Solr is great for full text search and going through data that requires tokenization. It's also pretty quick; however I don't think it is the right tool for this job.
You can also check out this blog post for more info on Solr commits.
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
Per the post:
Soft commits are about visibility, hard commits are about durability.
The thing to understand most about soft commits are that they will
make documents visible, but at some cost. In particular the “top
level” caches, which include what you configure in solrconfig.xml
(filterCache, queryResultCache, etc) will be invalidated! Autowarming
will be performed on your top level caches (e.g. filterCache,
queryResultCache), and any newSearcher queries will be executed. Also,
the FieldValueCache is invalidated, so facet queries will have to wait
until the cache is refreshed. With very frequent soft commits it’s
often the case that your top-level caches are little used and may, in
some cases, be eliminated. However, “segment level caches”, which
include function queries, sorting caches, etc are “per segment”, so
will not be invalidated on soft commit.
Redis => haven't explored this option
SOLR NRT with Soft Commits => It's kind of overkill and inefficient since it will be updating the complete document though only part of the document is updated every time. The more efficient way to handle this is by keeping these dynamic fields (like count, view count, etc) outside the Lucene index. There are two ways to handle this.
A. Using EFF (external file fields). In the post you have mentioned that:
EFF (external file fields) is not an option, as I read that the data in that file can only be used in function queries and cannot be returned as part of a document. If I'm not wrong, you want these dynamic field's corresponding value in the search response. We can get this by using field(exteranl_field_name) in fl parameter.
B. Using docValues.DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. DocValues are not part of the Lucene index. We can define these fields as docValue and use the partial update feature to just update these fields.
DocValue => https://solr.apache.org/guide/8_0/docvalues.html
EFF => https://solr.apache.org/guide/8_0/working-with-external-files-and-processes.html
Document Update => https://solr.apache.org/guide/6_6/updating-parts-of-documents.html
I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.