What's the quickest way to store an IPFS hash onto filecoin? - ipfs

I have an IPFS Hash / URL and I'd like to store it on Filecoin.
What is the quickest most effective way for me to do so?

Ally from Filecoin here :)
The easiest way to use ipfs and Filecoin to ensure censorship-resistance & resiliency is by
Using Fleek (while this does introduce a measure of centralisation, it still creates an immutable IPFS CID and does Filecoin storage deals for you, so all your data is still decentralised). You could then back up the resilience of it though by using multiple other services too eg. by adding the below:
Using an IPFS pinning service (such as pinata) and/or running your own IPFS node (again this does introduce a measure of centralisation unless you store on multiple)
Using web3.storage (or even nft.storage) - both of which are free services, which create an immutable CID (will be the same one for the same content) & makes a minimum of 8 x Filecoin deals perpetually for you (ie. permanent storage).
Using estuary.tech - which gives you the option of running your own estuary node also (both pins to ipfs and makes Filecoin storage deals) and is probably the best solution here
Using Chainsafe storage sdk or other storage helpers like Lighthouse Storage - which I believe allows you to drop in your own IPFS CID and creates these Filecoin deals for you.
If you DO want to make your own Filecoin storage deals, this is currently, well, hard (bring on FVM perpetual storage contracts & aggregation contracts - this will potentially allow for smart contracts that can make their own deals with the network!). It is possible to make your own deals now though with:
Snap deals - Ref
Other resource for making your own storage deals with Filecoin here
Some further resources on deploying decentralised websites:
Websites on IPFS
Breaking free of the client-server model video

Related

IPFS CID Reproducibility

This is perhaps very basic question but I couldn’t find in my search. Assuming the file content doesn’t change even single bit, would IPFS CID never change even if it’s being added/pinned by different users?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
No, it is not safe to assume that ipfs add <file> will always give the same CID as there are many parameters other than the hash function itself that the binary is free to change over time. At a high level ipfs add turns a file/directory in a tree structure called UnixFS that represents that data, and since the default way in which ipfs add is allowed to change over time it means that CID output by ipfs add example.txt can change
Many of the UnixFS parameters are configurable (and described in ipfs add --help) and include options such as raw leaves and chunk size. This means that if you'd really like to ensure that ipfs add example.txt results in the same CID there are a set of flags you can pass to ensure this is the case.
Note, in general I'd try to avoid importing the same data to IPFS multiple times (it's a waste of resources anyway) although there may be some scenarios where that's just the easiest, or best, thing to do to get your project off the ground.
As long as there won't be any default settings changed (and the user doesn't set any specific ones to override them), all clients will generate the same CID from the same binary data.
But on the other hand, there's talk to replace SHA-256 with Blake2b in the future, and there are two versions of CIDs: v0 and v1.
Most custom settings require v1 of the CID.
The other things which influence the CID are
--raw-leaves
--chunker as well as
--trickle
And
--inline as well as --inline-limit when you're dealing with small files.
IPFS CIDs are not truly self describing in the sense that they don't capture information about how the data was chunked and what kind of merkle dag was created.
If you want the CIDs to be the same, I would suggest using content archive files (and uploading that). Content archive files would contain information about the various parameters #Ruben described in this answer, and different ipfs clients will be forced to use those parameters even if their default settings are different.
For example, I was using web3.storage (chunking size 1MB) and pinata (chunking size 256kb), uploading the same file to them gave different CIDs which was unacceptable for our purposes. Using CAR files allowed to have the same CIDs.

What would be an ideal way to share writable volume across containers for a web server?

The application in question is Wordpress, I need to create replicas for rolling deployment / scaling purposes.
It seem can't create more then 1 instance of the same container, if it uses a persistent volume (GCP term):
The Deployment "wordpress" is invalid: spec.template.spec.volumes[0].gcePersistentDisk.readOnly: Invalid value: false: must be true for replicated pods > 1; GCE PD can only be mounted on multiple machines if it is read-only
What are my options? There will be occasional writes and many reads. Ideally writable by all containers. I'm hesitant to use the network file systems as I'm not sure whether they'll provide sufficient performance for a web application (where page load is rather critical).
One idea I have is, create a master container (write and read permission) and slaves (read only permission), this could work - I'll just need to figure out the Kubernetes configuration required.
In https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes you can see a table with the possible storage classes that allow ReadWriteMany (the option you are looking for).
AzureFile (not suitable if you are using GCP)
CephFS
Glusterfs
Quobyte
NFS
PortworxVolume
The one that I've tried is that of NFS. I had no issues with it, but I guess you should also consider potential performance issues. However, if the writes are to be occassional, it shouldn't be much of an issue.
I think what you are trying to solve is having a central location for wordperss media files, in that case this would be a better solution: https://wordpress.org/plugins/gcs/
Making your kubernetes workload truly stateless and you can scale horizontally.
You can use Regional Persistent Disk. It can be mounted to many nodes (hence pods) in RW more. These nodes can be spread across two zones within one region. Regional PDs can be backed by standard or SSD disks. Just note that as of now (september 2018) they are still in beta and may be subject to backward incompatible changes.
Check the complete spec here:
https://cloud.google.com/compute/docs/disks/#repds

Is using a WebServer for executing methods remotely a good architectural decision?

A very small explanation of what i am talking about is presented below.
In my organization a strange kind of effort for load distributing is
implemented, i will attempt to explain it below.
A JBoss server runs in a machine with some IP xxx.xxx.xxx.xxx
There is an app which has many heavy time & resource consuming work
to do (heavy in terms of I/O operation ex- Large file uploads &
downloads - in Gigabytes)
Methods written in a Rest WebApp accessible by the URLs & parameters
to methods passed as URL params, set to the type POST, return value
of the methods as JSON output.
The App in question has a framework written just to make post calls
to the aforementioned WebApp wrapped nicely so that the calls can be
made without the programmer knowing what happens in the background.
This framework has externalized parameters which takes in the IP of
the machines running the WebApps and configure a list of such
machines available & route the method calls to the one least busy
whenever a method call to the framework is made.
Everything looks good but i suspect wrapping things in a http websever & doing processing there & sending the outputs in json may slow down things & collecting the logs in case of failures maybe tough.
Questions
I want to know the views of other programmers on this & whether this is a nice approach or not.
Also do any existing commercial applications follow something similar while trying to distribute the load?
In my opinion, the efficiency of load balancing lies in the tricky process of choosing the right node that will process your request.
A usual approach can be to monitor the CPU usage of the nodes that have to take the load and send the load to the one least busy. This process should be both accurate and efficient.
In your case, wrapping up of request and transfer of Json data should be a thing of least concern as they seem to be required activities and light ones too. Focus should be on the load balancing activity.
Answer to comment
If I understood correctly, http methodserver are there to process the requests from the central app. Load balancing will not be done by them. There has to be a central load balancing mechanism/tool that has to decide which methodserver will take which request.
All the methodservers (slaves) should be identical. That reduces maintainability effort as a fix can be done on one node and can be propagated to other nodes. This is how it is done in my organization.
It may be a requirement if a similar operation is done repeatedly, application server may reduce some duplicity here.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.

How to get your code ready for Loadbalancing

As we did this in the past, i'd like to gather useful information for everyone moving to loadbalancing, as there are issues which your code must be aware of.
We moved from one apache server to squid as reverse proxy/loadbalancer with three apache servers behind.
We are using PHP/MySQL, so issues may differ.
Things we had to solve:
Sessions
We moved from "default" php sessions (files) to distributed memcached-sessions. Simple solution, has to be done. This way, you also don't need "sticky sessions" on your loadbalancer.
Caching
To our non-distributed apc-cache per webserver, we added anoter memcached-layer for distributed object caching, and replaced all old/outdated filecaching systems with it.
Uploads
Uploads go to a shared (nfs) folder.
Things we optimized for speed:
Static Files
Our main NFS runs a lighttpd, serving (also user-uploaded) images. Squid is aware of that and never queries our apache-nodes for images, which gave a nice performance boost. Squid is also configured to cache those files in ram.
What did you do to get your code/project ready for loadbalancing, any other concerns for people thinking about this move, and which platform/language are you using?
When doing this:
For http nodes, I push hard for a single system image (ocfs2 is good for this) and use either pound or crossroads as a load balancer, depending on the scenario. Nodes should have a small local disk for swap and to avoid most (but not all) headaches of CDSLs.
Then I bring Xen into the mix. If you place a small, temporal amount of information on Xenbus (i.e. how much virtual memory Linux has actually promised to processes per VM aka Committed_AS) you can quickly detect a brain dead load balancer and adjust it. Oracle caught on to this too .. and is now working to improve the balloon driver in Linux.
After that I look at the cost of splitting the database usage for any given app across sqlite3 and whatever db the app wants natively, while realizing that I need to split the db so posix_fadvise() can do its job and not pollute kernel buffers needlessly. Since most DBMS services want to do their own buffering, you must also let them do their own clustering. This really dictates the type of DB cluster that I use and what I do to the balloon driver.
Memcache servers then boot from a skinny initrd, again while the privileged domain watches their memory and CPU use so it knows when to boot more.
The choice of heartbeat / takeover really depends on the given network and the expected usage of the cluster. Its hard to generalize that one.
The end result is typically 5 or 6 physical nodes with quite a bit of memory booting a virtual machine monitor + guests while attached to mirrored storage.
Storage is also hard to describe in general terms.. sometimes I use cluster LVM, sometimes not. The not will change when LVM2 finally moves away from its current string based API.
Finally, all of this coordination results in something like Augeas updating configurations on the fly, based on events communicated via Xenbus. That includes ocfs2 itself, or any other service where configurations just can't reside on a single system image.
This is really an application specific question .. can you give an example? I love memcache, but not everyone can benefit from using it, for instance. Are we reviewing your configuration or talking about best practices in general?
Edit:
Sorry for being so Linux centric ... its typically what I use when designing a cluster.