I want to store user's profiles in redis, as I have to frequently read multiple user's profiles.. there are two options I see at present:
Option 1: - store separate hash key per user's profile
[hash] - u1 profile {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - u2 profile {id: u2, name:user2, email:user2#domain.com, photo:url}
where for every user's id is hash key and profile field and values JSON-serialized profile objects. (OR instead of json user field-value pairs)
Option 2: - use single hash key to store all users profile
[hash] - users-profile u1 {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - users-profile u2 {id:u2, name:user2, email:user2#domain.com, photo:url}
where in users-profile hash key, user's ids field and values JSON-serialized profile objects.
Please tell me which option is best considering following:
performance
memory utilization
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time
larger dataset - what if there are millions users profile
As Sergio Tulentsev pointed out, its not good to store all the user's data (especially if the dataset is huge) inside one single hash by any means.
Storing the users data as individual keys is also not preferred if your looking for memory optimization as pointed out in this blog post
Reading the user's data using pagination mechanism demands one to use a database rather than a simple caching system like redis. Hence it's recommended to use a NoSQL database such as mongoDB for this.
But reading from the database each time is a costly operation especially if you're reading a lot of records.
Hence the best solution would be to cache the most active user's data in redis to eliminate the database fetch overhead.
I recommend you looking into walrus .
It basically follows the following pattern:
#cache.cached(timeout=expiry_in_secs)
def function_name(param1, param2, ...., param_n):
# perform database fetch
# return user data
This ensures that the frequently accessed or requested user data is in redis and the function automatically returns the value from redis than making the database call. Also the key is expired if not accessed for a long time.
You set it up as follows:
from walrus import *
db = Database(host='localhost', port=6379, db=0)
where host can take the domain name of the redis cluster running remotely.
Hope this helps.
Option #1.
Performance: Typically it depends on your use case but let say that you want to read a specific user (on the login/logout, authorization purposes, etc). With option #1, you simply compute the user hash and get the user profile. With option #2, you will need to get all users profiles and parse the json (although you can make it efficient it would never be so efficient and simpler as option #1);
Memory utilization: You can make option #1 and option #2 take the same size in redis (on option #1, you can avoid storing the hash/user id as part of the json). However, and picking the same example to load a specific user, you just need to in code/memory a single user profile json instead of a bigger json with a set of user profiles
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time: For this, as typically is done with a relational database, you want to do paging. There are different ways of doing paging with redis but using a scan operation is an easy way to iterate over a set of users
larger dataset - what if there are millions users profile:
Redis is an in-memory but persistent on disk database, so it
represents a different trade off where very high write and read speed
is achieved with the limitation of data sets that can't be larger than
memory
If you "can't have a dataset larger the memory", you can look to Partitioning as the Redis FAQ suggests. On the Redis FAQ you can also check other metrics such as the "maximum number of keys a single Redis instance can hold" or "Redis memory footprint"
PROS for option 1
(But don't use hash, use single key. Like SET profile:4d094f58c96767d7a0099d49 {...})
Iterating keys is slightly faster than iterating hash. (That's also why you should modify option 1 to use SET, not HSET)
Retrieving key value is slightly faster than retrieving hash field
PROS for option 2
You can get all users in a single call with HMGET, but only if your user base is not very big. Otherwise it can be too hard for server to serve you the result.
You can flush all users in a single command. Useful if you have backing DB.
PROS for option 3
Option 3 is to break your user data in hash buckets determined by hash from user id. Works good if you have many users and do batches often. Like this:
HSET profiles:<bucket> <id> {json object}
HGET profiles:<bucket> <id>
HMGET profiles:<bucket>
The last one to get a whole bucket of profiles. Don't recommend it to be more than 1mb in total. Works good with sequential ids, not so good with hashes, because they can grow too much. If you used it with hashes and it grew too much that this slows your Redis, you can fallback to HSCAN (like in option2) or redistribute objects to more buckets with new hash function.
Faster batch load
Slightly slower single object store/load
My recommendation, if I got your situation right, is to use 3rd option with sequential ids of range 100. And if you aiming at hight amounts of data, plan for cluster from day one.
Related
I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!
Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.
Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.
While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
Our DB is mostly reads, but we want to add a "View count" and "thumbs up/thumbs down" to our videos.
When we stress tested incrementing views in mysql, our database started deadlocking.
I was thinking about handling this problem by having a redis DB that holds the view count, and only writes to the DB once the key expires. But, I hear the notifications are not consistent, and I don't want to lose the view data.
Is there a better way of going about this? Or is the talk of redis notifications being inconsistent not true.
Thanks,
Sammy
Redis' keyspace notifications are consistent, but delivery isn't guaranteed.
If you don't want to lose data, implement your own background process that manually expires the counters - i.e. copies to MySQL and deleted from Redis.
There are several approaches to implementing this lazy eviction pattern. For example, you can use a Redis Hash with two fields: a value field that you can HINCRBY and a timestamp field for expiry logic purposes. Your background process can then SCAN the keyspace to identify outdated keys.
Another way is to use Sorted Sets to manage the counters. In some cases you can use just one Sorted Set, encoding both TTL and count into each member's score (using the float's integer and fractional parts, respectively), but in most cases it is simpler to use two Sorted Sets - one for TTLs and the other fur values.
I've got a web application, it has the normal feature, user settings etc these are all stored in MYSQL with the user etc.....
A particular part of the application a is a table of data for the user to edit.
I would like to make this table real time, across multiple users. Ie multiple users can open the page edit the data and see changes in real time done by other users editing the table.
My thinking is to cache the data for the table in Redis, then preform all the actions in redis like keeping all the clients up to date.
Once all the connection have closed for a particular table save the data back to mysql for persistence, I know Redis can be used as a persistent NoSQL database but as RAM is limited and all my other data is stored in MYSQL, mysql seems a better option.
Is this a correct use case for redis? Is my thinking correct?
It depends on the scalability. The number of records you are going to deal with and the structure you are going to use for saving it.
I will discuss about pros and crons of using redis. The decision is up to you.
Advantages of using redis:
1) It can handle heavy writes and reads in comparison with MYSQL
2) It has flexible structures (hashmap, sorted set etc) which can
localise your writes instead of blocking the whole table.
3) Read queries will be much faster as it is served from cache.
Disadvantages of using redis:
1) Maintaining transactions. What happens if both users try to access a
particular cell at a time? Do you have a right data structure in redis to
handle this case?
2) What if the data is huge? It will exceed the memory limit.
3) What happens if there is a outage?
4) If you plan for persistence of redis. Say using RDB or AOF. Will you
handle those 5-10 seconds of downtime?
Things to be focussed:
1) How much data you are going to deal with? Assume for a table of 10000 rows wit 10 columns in redis takes 1 GB of memory (Just an assumption actual memory will be very much less). If your redis is 10GB cluster then you can handle only 10 such tables. Do a math of about how many rows * column * live tables you are going to work with and the memory it consumes.
2) Redis uses compression for data within a range http://redis.io/topics/memory-optimization. Let us say you decide to save the table with a hashmap, you have two options, for each column you can have a hashmap or for each row you can have a hashmap. Second option will be the optimal one. because storing 1000 (hashmaps -> rows) * 20 (records in each hash map -> columns) will take 10 time less memory than storing in the other way. Also in this way if a cell is changed you can localize in hashmap of within 20 values.
3) Loading the data back in your MYSQL. how often will this going to happen? If your work load is high then MYSQL begins to perform worse for other operations.
4) How are you going to deal with multiple clients on notifying the changes? Will you load the whole table or the part which is changes? Loading the changed part will be the optimal one. In this case, where will you maintain the list of cells which have been altered?
Evaluate your system with these questions and you will find whether it is feasible or not.
Say I have a table of 100000 cities (id, city) and my application is laid out in a way that I require the id->city mapping for many searches. Running a SQL record for each such translation is prohibitory. Say a search result page displays 1000 records, I don't want to add 1000 SQL queries on top of the existing query.
I've read about the eager loading primitive (:includes) and that doesn't' exactly fit my need. I want to have this table of 100000 cities resident in memory. At present I am trying to avoid something like redis to save myself from one dependency.
Is there a shared memory area where I can shove in this hash table when rails/passenger starts and then all incoming requests can lookup this persistent hash from there.
I believe if I create such an hash in my application_controller, this hash will be initialized every time a request comes in which will make things worse than what I have now ( or will it not ?)
What is the rails way of instantiating a shared persistent memory which all requests can share ?
It sounds like you need caching, you're just trying to avoid an external server. I've used Rails' built-in memory store cache solution for large calculations that are globally valid but I don't want to make a roundtrip to an external cache for.
Use and adapt this setting in your environment.
config.cache_store = :memory_store, { size: 64.megabytes }
And of course, the Rails Caching Guide has all the details.
I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.