In https://eips.ethereum.org/EIPS/eip-4337, the authors say "Users send UserOperation objects into a separate mempool". i was wondering what it means.. and now, i think that they refer to off-chain memory pool for storing Pending UserOperation(High-level Transaction). so, i think that 'alt mempool' is probably 'alternative memory pool' that is implemented based distributed in-memory key-value store (e.g. Redis) that stores pending UserOperations.
Did I understand correctly? how do you think about it?
what i did : read and read and think.. about eip-4337.
what i expect : check whether i understand it correctly or not
Related
I work on a bench of my Kafka cluster in version 1.0.0-cp1.
In part of my bench who focus on the max throughput possible with ordering guarantee and no data loss (a topic with only one partition), need I to set the max.in.flight.requests.per.connection property to 1?
I've read this article
And I understand that I only have to set the max.in.flight to 1 if I enable the retry feature at my producer with the retries property.
Another way to ask my question: Only one partition + retries=0 (producer props) is sufficient to guarantee the ordering in Kafka?
I need to know because increase the max.in.flight increases drastically the throughput.
Your use case is slightly unclear. You mention ordering and no data loss but don't specify if you tolerate duplicate messages. So it's unclear if you want At least Once (QoS 1) or Exactly Once
Either way, as you're using 1.0.0 and only using a single partition, you should have a look at the Idempotent Producer instead of tweaking the Producer configs. It allows to properly and efficiently guarantee ordering and no data loss.
From the documentation:
Idempotent delivery ensures that messages are delivered exactly once
to a particular topic partition during the lifetime of a single
producer.
The early Idempotent Producer was forcing max.in.flight.requests.per.connection to 1 (for the same reasons you mentioned) but in the latest releases it can now be used with max.in.flight.requests.per.connection set to up to 5 and still keep its guarantees.
Using the Idempotent Producer you'll not only get stronger delivery semantics (Exactly Once instead of At least Once) but it might even perform better!
I recommend you check the delivery semantics [in the docs]
[in the docs]:http://kafka.apache.org/documentation/#semantics
Back to your question
Yes without the idempotent (or transactional) producer, if you want to avoid data loss (QoS 1) and preserve ordering, you have to set max.in.flight.requests.per.connection to 1, allow retries and use acks=all. As you saw this comes at a significant performance cost.
Yes, you must set the max.in.flight.requests.per.connection property to 1.
In the article you have read it was an initial mistake (currently corrected) where author wrote:
max.in.flights.requests.per.session
which doesn't exist in the Kafka documentation.
This errata comes probably from the book "Kafka The Definitive Guide" (1st edition) where you can read in the page 52:
<...so if guaranteeing order is critical, we recommend setting
in.flight.requests.per.session=1 to make sure that while a batch of
messages is retrying, additional messages will not be sent ...>
imo, it is invaluable to also know about this issue that makes things far more interesting and slightly more complicated.
When you enable enable.idempotence=true , every time you send a message to the broker, you also send a sequence number, starting from zero. Brokers store that sequence number too on their side. When you make a next request to the broker, let’s say with sequence_id=3, the broker can look at its currently stored sequence number and say :
if its 4 - good, its a new batch of records
if its 3 - its a duplicate
if its 5 (or higher), it means messages were lost
And now max.inflight.requests.per.connection . A producer can send as many as this value concurrent requests without actually waiting for an answer from the broker. When we reach 3 (let’s say max.inflight.requests.per.connection=3) , we start to ask the broker for the previous results (at the same time we can’t process any batches now even if they are ready).
Now, for the sake of the example, let’s say the broker says this : “1 was OK, I stored it”, “2 has failed” and now the important part: because 2 failed, the only possible thing you can get for 3 is “out of order”, which means it did not store it. The client now knows that it needs to reprocess 2 and 3 and it creates a List and resends them - in that exact order; if retry is enabled.
This explanation is probably over simplified, but this is my basic understanding after reading the source code a bit.
I'm fairly new to NoSQL type databases, including Azure's DocumentDB. I've read through the documentation and understand the basics.
The documentation left me with some questions about data modeling, particularly in how it relates to pricing.
Microsoft charges fees on a "per collection" basis, with a collection being a list of JSON objects with no particular schema, if I understand it correctly.
Now, since there is no requirement for a uniform schema, is the expectation that your "collection" is analogous to a "database" in that the collection itself might contain different types of objects? Or is the expectation that each "collection" is analogous to a "table" in that it contains only objects of similar type (allowing for variance in the object properties, perhaps).
Does query performance dictate one way or another here?
Thanks for any insight!
The normal pattern under DocumentDB is to store lots of different types of objects in the same "collection". You distinguish them by either have a field type = "MyType" or with isMyType = true. The latter allows for subclassing and mixin behavior.
As for performance, DocumentDB gives you guaranteed 10ms read/15ms write latency for your chosen throughput. For your production system, put everything in one big "partitioned collection" and slide the size and throughput levers over time as your space needs and load demands. You'll get essentially infinite scalability and DocumentDB will take care of allocating (and deallocating) resources (secondaries, partitions, etc.) as you increase (or decrease) your throughput and size levers.
A collection is analogous to a database, more so than a relational table. Normally, you would store a type property within documents to distinguish between types, and add the AND type='MyType' filter to each of your queries if restricting to a certain type.
Query performance will not be significantly different if you store different types of documents within the same collection vs. different collections because you're just adding another filter against an indexed property (type). You might however benefit from pooling throughput into a single collection, vs. spreading small amounts of throughput for each type/collection.
I want to store user's profiles in redis, as I have to frequently read multiple user's profiles.. there are two options I see at present:
Option 1: - store separate hash key per user's profile
[hash] - u1 profile {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - u2 profile {id: u2, name:user2, email:user2#domain.com, photo:url}
where for every user's id is hash key and profile field and values JSON-serialized profile objects. (OR instead of json user field-value pairs)
Option 2: - use single hash key to store all users profile
[hash] - users-profile u1 {id: u1, name:user1, email:user1#domain.com, photo:url}
[hash] - users-profile u2 {id:u2, name:user2, email:user2#domain.com, photo:url}
where in users-profile hash key, user's ids field and values JSON-serialized profile objects.
Please tell me which option is best considering following:
performance
memory utilization
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time
larger dataset - what if there are millions users profile
As Sergio Tulentsev pointed out, its not good to store all the user's data (especially if the dataset is huge) inside one single hash by any means.
Storing the users data as individual keys is also not preferred if your looking for memory optimization as pointed out in this blog post
Reading the user's data using pagination mechanism demands one to use a database rather than a simple caching system like redis. Hence it's recommended to use a NoSQL database such as mongoDB for this.
But reading from the database each time is a costly operation especially if you're reading a lot of records.
Hence the best solution would be to cache the most active user's data in redis to eliminate the database fetch overhead.
I recommend you looking into walrus .
It basically follows the following pattern:
#cache.cached(timeout=expiry_in_secs)
def function_name(param1, param2, ...., param_n):
# perform database fetch
# return user data
This ensures that the frequently accessed or requested user data is in redis and the function automatically returns the value from redis than making the database call. Also the key is expired if not accessed for a long time.
You set it up as follows:
from walrus import *
db = Database(host='localhost', port=6379, db=0)
where host can take the domain name of the redis cluster running remotely.
Hope this helps.
Option #1.
Performance: Typically it depends on your use case but let say that you want to read a specific user (on the login/logout, authorization purposes, etc). With option #1, you simply compute the user hash and get the user profile. With option #2, you will need to get all users profiles and parse the json (although you can make it efficient it would never be so efficient and simpler as option #1);
Memory utilization: You can make option #1 and option #2 take the same size in redis (on option #1, you can avoid storing the hash/user id as part of the json). However, and picking the same example to load a specific user, you just need to in code/memory a single user profile json instead of a bigger json with a set of user profiles
read multiple user's profile - for batch processing I should able to read 1-100, 101-200 user's profile at time: For this, as typically is done with a relational database, you want to do paging. There are different ways of doing paging with redis but using a scan operation is an easy way to iterate over a set of users
larger dataset - what if there are millions users profile:
Redis is an in-memory but persistent on disk database, so it
represents a different trade off where very high write and read speed
is achieved with the limitation of data sets that can't be larger than
memory
If you "can't have a dataset larger the memory", you can look to Partitioning as the Redis FAQ suggests. On the Redis FAQ you can also check other metrics such as the "maximum number of keys a single Redis instance can hold" or "Redis memory footprint"
PROS for option 1
(But don't use hash, use single key. Like SET profile:4d094f58c96767d7a0099d49 {...})
Iterating keys is slightly faster than iterating hash. (That's also why you should modify option 1 to use SET, not HSET)
Retrieving key value is slightly faster than retrieving hash field
PROS for option 2
You can get all users in a single call with HMGET, but only if your user base is not very big. Otherwise it can be too hard for server to serve you the result.
You can flush all users in a single command. Useful if you have backing DB.
PROS for option 3
Option 3 is to break your user data in hash buckets determined by hash from user id. Works good if you have many users and do batches often. Like this:
HSET profiles:<bucket> <id> {json object}
HGET profiles:<bucket> <id>
HMGET profiles:<bucket>
The last one to get a whole bucket of profiles. Don't recommend it to be more than 1mb in total. Works good with sequential ids, not so good with hashes, because they can grow too much. If you used it with hashes and it grew too much that this slows your Redis, you can fallback to HSCAN (like in option2) or redistribute objects to more buckets with new hash function.
Faster batch load
Slightly slower single object store/load
My recommendation, if I got your situation right, is to use 3rd option with sequential ids of range 100. And if you aiming at hight amounts of data, plan for cluster from day one.
Say I have a table of 100000 cities (id, city) and my application is laid out in a way that I require the id->city mapping for many searches. Running a SQL record for each such translation is prohibitory. Say a search result page displays 1000 records, I don't want to add 1000 SQL queries on top of the existing query.
I've read about the eager loading primitive (:includes) and that doesn't' exactly fit my need. I want to have this table of 100000 cities resident in memory. At present I am trying to avoid something like redis to save myself from one dependency.
Is there a shared memory area where I can shove in this hash table when rails/passenger starts and then all incoming requests can lookup this persistent hash from there.
I believe if I create such an hash in my application_controller, this hash will be initialized every time a request comes in which will make things worse than what I have now ( or will it not ?)
What is the rails way of instantiating a shared persistent memory which all requests can share ?
It sounds like you need caching, you're just trying to avoid an external server. I've used Rails' built-in memory store cache solution for large calculations that are globally valid but I don't want to make a roundtrip to an external cache for.
Use and adapt this setting in your environment.
config.cache_store = :memory_store, { size: 64.megabytes }
And of course, the Rails Caching Guide has all the details.
I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.