how to do ipfs pin add and get within 10 seconds? - ipfs

In my project, I need to download data from ipfs by giving a CID.
What I do is:
ipfs pin add {CID}
ipfs get {CID}
But I found these two steps are quite time-consuming, it takes at least 1min above.
I tried localhost and infura.
What can I do to let it download faster?

When you just want to download files, there is no need to pin them first. This might save a (tiny) bit of overhead.
However, the bulk time is probably spent in
Looking up nodes in the distributed hash table that provide your data, and
Actually transferring the data from these nodes.
For small data sizes, the first item is probably the limiting factor. In general, the duration depends on the number of nodes that host your connects to and how fast these nodes (can) transfer the data to your client.

Related

Using Consul for dynamic configuration management

I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.

Is temporary, non-persistent storage possible on IPFS?

I'm looking to store data temporarily on IPFS, probably using a JS library, may or may not host an IPFS node. The technology is so new it isn't easy to locate answers. Please, I appreciate your help, especially in both cases of hosted and non-hosted solutions. Thank you.
Data is only stored persistently on nodes that have "pinned" the content hash. Nodes that request data will cache it for an indeterminate amount of time (as the data is immutable and referenced by hash, the cache would always be valid).
Once you are finished with the data you could remove it from your node, and as long as no one else requested it it would gradually disappear from the network. You could not rely on that happening (for instance if you have a requirement that the data be inaccessible after a certain amount of time).
You would need to be running your own node to initially host the file

Frequently updating a large JSON file on Amazon S3 and potential write conflict

I first want to give a little overview on what I'm trying to tackle. My service is frequently fetching posts from various sources such as Instagram, Twitter, etc. and I want to store the posts in one large JSON file on S3. The file name would be something like: {slideshowId}_feed.json
My website will display the posts in a slideshow, and the slideshow will simply poll the S3 file every minute or so to get the latest data. It might even poll another file such as {slideshowId}_meta.json that has timestamp from when the large file changed in order to save bandwidth.
The reason I want to keep the posts in a single JSON file is mainly to save cost. I could have each source as its own file, e.g. {slideshowId}_twitter.json, {slideshowId}_instagram.json, etc. but then the slideshow would need to send GET request to every source every minute, thus increasing the cost. We're talking about thousands of slideshows running at once, so the cost needs to scale well.
Now back to the question. There may be more than one instance of the service running that checks Instagram and other sources for new posts, depending on how much I need to scale out. The problem with that is the risk of one service overwriting the S3 file while another one might
already be writing to it.
Each service that needs to save posts to the JSON file would first have to GET the file, process it and check that the new posts are not duplicated in the JSON file, and then store the new or updated posts.
Could I have each service write the data to some queue like the Simple Queue Service
(SQS) and then have some worker that takes care of writing the posts to the S3 file?
I thought about using AWS Kinesis, but it just processes the data
from the sources and dumps it to S3. I need to process what has been
written to the large JSON file as well to do some book keeping.
I had an idea of using DynamoDB to store the posts (basically to do the book keeping), and
then I would simply have the service query all the data needed for a
single slideshow from DynamoDB and store it to S3. That way the services would simply send the posts to DynamoDB.
There must be some clever way to solve this problem.
Ok for your use case
there are many users for a single large s3 file
the file is updated often
the file path (ideally) should be consistent to make it easier to get and cache
the s3 file is generated by a process on a ec2 and updated once per minute
If the GET rate is less than 800 per second then AWS is happy with it. If not then you'll have to talk to them and maybe find another way. See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
The file updates will be atomic so there are no issues with locking etc. See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Presumably if a user requests "during" an update they will see the old version. This behaviour is transparent to both parties
File updates are "eventually" consistent. As you want to keep the url the same you will be updating the same object path in s3.
If you are serving across regions then the time it takes to become consistent might be an issue. For the same region it seems to take a few seconds. AWS don't seem to be very open about this, so it's probably best to test it for your use case. As your file is small and the updates are per 60 seconds then I would imagine it would be ok. You might have to assume in your API description that updates actually happen over a greater time than 60 seconds to take this into account
As ec2 and s3 run on different parts of the AWS infrastructure (ec2 in a VPC and s3 behind a public https) You will pay for transfer costs from ec2 to s3
I would imagine that you will be serving the s3 file via the s3 "pretend to be a website" feature. You will have to configure this too, but that is trivial
This is what I would do:
The Kinesis stream would need to have enough capacity to handle writes from all your feed producers. For about 25/month you get to do 2000 writes per second.
Lambda would be simply fired whenever there is enough new items on your stream. You can configure trigger to wait for 1000 new items and then run the Lambda to read all new items from the stream, process them and write them to REDIS (ElastiCache). Your bill for that should be well under 10/month.
Smart key selection would take care of duplicate items. You can also set the items to expire if you need to. According to your description your items should definitely fit into memory and you can add instances if you need more capacity for reading and/or reliability. Running two REDIS instances with enough memory to handle your data would cost around 26/month.
Your service would use REDIS instead of S3, so you would only pay for the data transfer and only if your service is not on AWS (<10/month?).

Storing/versioning large amount of JSON objects

Our cloud service deals with chunks of JSON data (item) which is being manipulated all the time. It can be changed as fast as every second.
At the moment item is JSON object that is being modified all the time. Now we need to implement versioning of these items as well.
Basically, every time request to modify the object arrives, it is modified, saved to DB and then we also need to store that version somewhere. So later on you will be able to say "give me version 345 of this item".
My question is - what would be the ideal way to store this history. Mind you, we do not need to query or alter the data once saved, all we need is to load it if necessary (0.01% of time) - the data is meaningless blob basically.
We are researching multiple approaches:
Simple text files (file system)
Cloud storage (eg S3)
Version control (eg GIT)
Database (any)
Vault (For example Vault from hashicorp)
Main problem is that since items are updated every second, we end up with a lot of blobs. Consider - 100 items, updated every second - thats 8,640,000 records in a single day. Not to mention 100rps for the DB.
Do you have any recommendation as to what would be the optimal approach? We need it to be scalable, fast, reliable, encryption out-of-the-box would be great plus.

How to configure RabbitMQ to store only part of messages in RAM (because I have really large queue)

I'm going to use RabbitMQ in a project where large amounts of data (~2*10^7 messages, 800 bytes each) need to be stored and processed. Of course, all this data won't fit in RAM, so I have a question: how to configure RabbitMQ to save only part of messages in RAM, and another part -- on disk?
Thank you.
Oops, found answer on my own question, let me share it:
Accordingly to http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/ :
When queues are small(ish) they will reside entirely within memory. Persistent messages will also get written to disc, but they will only get read again if the broker restarts. But when queues get larger, they will get paged to disc, persistent or not.