IPFS content availability / health information - ipfs

Is there a way to check how many nodes contain a specific content ?
If an hashed content is stored on just a node and the node goes offline, the content will not be available anymore.
If on the other hand the content has previously been replicated on other nodes, when the first node goes offline the content is still available.

Yes, you can use ipfs dht findprovs command for finding peers providing specific CID.

Related

IPFS CID Reproducibility

This is perhaps very basic question but I couldn’t find in my search. Assuming the file content doesn’t change even single bit, would IPFS CID never change even if it’s being added/pinned by different users?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
No, it is not safe to assume that ipfs add <file> will always give the same CID as there are many parameters other than the hash function itself that the binary is free to change over time. At a high level ipfs add turns a file/directory in a tree structure called UnixFS that represents that data, and since the default way in which ipfs add is allowed to change over time it means that CID output by ipfs add example.txt can change
Many of the UnixFS parameters are configurable (and described in ipfs add --help) and include options such as raw leaves and chunk size. This means that if you'd really like to ensure that ipfs add example.txt results in the same CID there are a set of flags you can pass to ensure this is the case.
Note, in general I'd try to avoid importing the same data to IPFS multiple times (it's a waste of resources anyway) although there may be some scenarios where that's just the easiest, or best, thing to do to get your project off the ground.
As long as there won't be any default settings changed (and the user doesn't set any specific ones to override them), all clients will generate the same CID from the same binary data.
But on the other hand, there's talk to replace SHA-256 with Blake2b in the future, and there are two versions of CIDs: v0 and v1.
Most custom settings require v1 of the CID.
The other things which influence the CID are
--raw-leaves
--chunker as well as
--trickle
And
--inline as well as --inline-limit when you're dealing with small files.
IPFS CIDs are not truly self describing in the sense that they don't capture information about how the data was chunked and what kind of merkle dag was created.
If you want the CIDs to be the same, I would suggest using content archive files (and uploading that). Content archive files would contain information about the various parameters #Ruben described in this answer, and different ipfs clients will be forced to use those parameters even if their default settings are different.
For example, I was using web3.storage (chunking size 1MB) and pinata (chunking size 256kb), uploading the same file to them gave different CIDs which was unacceptable for our purposes. Using CAR files allowed to have the same CIDs.

What would be an ideal way to share writable volume across containers for a web server?

The application in question is Wordpress, I need to create replicas for rolling deployment / scaling purposes.
It seem can't create more then 1 instance of the same container, if it uses a persistent volume (GCP term):
The Deployment "wordpress" is invalid: spec.template.spec.volumes[0].gcePersistentDisk.readOnly: Invalid value: false: must be true for replicated pods > 1; GCE PD can only be mounted on multiple machines if it is read-only
What are my options? There will be occasional writes and many reads. Ideally writable by all containers. I'm hesitant to use the network file systems as I'm not sure whether they'll provide sufficient performance for a web application (where page load is rather critical).
One idea I have is, create a master container (write and read permission) and slaves (read only permission), this could work - I'll just need to figure out the Kubernetes configuration required.
In https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes you can see a table with the possible storage classes that allow ReadWriteMany (the option you are looking for).
AzureFile (not suitable if you are using GCP)
CephFS
Glusterfs
Quobyte
NFS
PortworxVolume
The one that I've tried is that of NFS. I had no issues with it, but I guess you should also consider potential performance issues. However, if the writes are to be occassional, it shouldn't be much of an issue.
I think what you are trying to solve is having a central location for wordperss media files, in that case this would be a better solution: https://wordpress.org/plugins/gcs/
Making your kubernetes workload truly stateless and you can scale horizontally.
You can use Regional Persistent Disk. It can be mounted to many nodes (hence pods) in RW more. These nodes can be spread across two zones within one region. Regional PDs can be backed by standard or SSD disks. Just note that as of now (september 2018) they are still in beta and may be subject to backward incompatible changes.
Check the complete spec here:
https://cloud.google.com/compute/docs/disks/#repds

Is using a WebServer for executing methods remotely a good architectural decision?

A very small explanation of what i am talking about is presented below.
In my organization a strange kind of effort for load distributing is
implemented, i will attempt to explain it below.
A JBoss server runs in a machine with some IP xxx.xxx.xxx.xxx
There is an app which has many heavy time & resource consuming work
to do (heavy in terms of I/O operation ex- Large file uploads &
downloads - in Gigabytes)
Methods written in a Rest WebApp accessible by the URLs & parameters
to methods passed as URL params, set to the type POST, return value
of the methods as JSON output.
The App in question has a framework written just to make post calls
to the aforementioned WebApp wrapped nicely so that the calls can be
made without the programmer knowing what happens in the background.
This framework has externalized parameters which takes in the IP of
the machines running the WebApps and configure a list of such
machines available & route the method calls to the one least busy
whenever a method call to the framework is made.
Everything looks good but i suspect wrapping things in a http websever & doing processing there & sending the outputs in json may slow down things & collecting the logs in case of failures maybe tough.
Questions
I want to know the views of other programmers on this & whether this is a nice approach or not.
Also do any existing commercial applications follow something similar while trying to distribute the load?
In my opinion, the efficiency of load balancing lies in the tricky process of choosing the right node that will process your request.
A usual approach can be to monitor the CPU usage of the nodes that have to take the load and send the load to the one least busy. This process should be both accurate and efficient.
In your case, wrapping up of request and transfer of Json data should be a thing of least concern as they seem to be required activities and light ones too. Focus should be on the load balancing activity.
Answer to comment
If I understood correctly, http methodserver are there to process the requests from the central app. Load balancing will not be done by them. There has to be a central load balancing mechanism/tool that has to decide which methodserver will take which request.
All the methodservers (slaves) should be identical. That reduces maintainability effort as a fix can be done on one node and can be propagated to other nodes. This is how it is done in my organization.
It may be a requirement if a similar operation is done repeatedly, application server may reduce some duplicity here.

Is it possible to combine a dom-repeat on a platinum service worker?

I want to provide a user the ability to cache up to 2,600+ items, by groupings (categories of book, individual books, or possibly even just chapters of a certain book if they don't want the whole book). It is not possible, as far as I can tell, to precache all of these items because there are 2,600+ of them, and will be more in the future - the service worker will timeout with under a couple hundred. And since service workers either get all or none on install (if I understand correctly), do I need to use multiple services workers (with different ids?), or am I thinking about this wrong?
What I am thinking is something like...
<iron ajax></iron-ajax>
<template is="dom-repeat" items="...">
<platinum-sw-register auto-register clients-claim skip-waiting>
<platinum-sw-cache default-cache-strategy="fastest"
cache-config-file="../someGenerator.php"></platinum-sw-cache>
</platinum-sw-register>
In other words:
Get a list of wanted URLs via iron-ajax (based upon what the user enables for cache)
Iterate through the URLs as groups via dom-repeat
Create a service worker with a customized cache-config for the URL group
Repeat 2 and 3 until done, then present a toast
That someGenerator.php would return a JSON config setup for the particular group of URLs.
My app is a single page app - with neon-animated-pages - one page representing categories, one for book listings, one for table of contents for each book, and then one of each the chapter contents. All of the data is obtained via iron-ajax.
Here are some links to demonstrate the issues:
The App
A large non-functional cache-config generated
I suspect, in order to not have service workers errors due to redundancy, or overwrite existing caches, I will need to assign individual ids, and include them in the generated cache-configs. Does that sound right?
No, I don't think that's the right approach. <dom-repeat> and creating multiple service workers isn't going to accomplish what you want.
It does look like you're bumping into some service worker-imposed timeouts during your install handler due to the delays in fetching the JSON configuration and performing all of the precaching. Taking a step back, are you sure that you need that entire set of URLs precached?
<platinum-sw> will give you runtime caching as well, so that when a browser loads a given URL when there's a network connection available, the resources will be automatically added to the cache and available offline during subsequent return visits.
There are other approaches that would use either window.caches to cache resources from within your controlled page, or using something like postMessage() to communicate a list of additional URLs to cache from your controlled page to your service worker. Both of those approaches would involve going beyond the default functionality you get from using <platinum-sw> and digging into the internals a bit.

REST API - file (ie images) processing - best practices

We are developing server with REST API, which accepts and responses with JSON. The problem is, if you need to upload images from client to server.
Note: and also I am talking about a use-case where the entity (user) can have multiple files (carPhoto, licensePhoto) and also have other properties (name, email...), but when you create new user, you don't send these images, they are added after the registration process.
The solutions I am aware of, but each of them have some flaws
1. Use multipart/form-data instead of JSON
good : POST and PUT requests are as RESTful as possible, they can contain text inputs together with file.
cons : It is not JSON anymore, which is much easier to test, debug etc. compare to multipart/form-data
2. Allow to update separate files
POST request for creating new user does not allow to add images (which is ok in our use-case how I said at beginning), uploading pictures is done by PUT request as multipart/form-data to for example /users/4/carPhoto
good : Everything (except the file uploading itself) remains in JSON, it is easy to test and debug (you can log complete JSON requests without being afraid of their length)
cons : It is not intuitive, you cant POST or PUT all variables of entity at once and also this address /users/4/carPhoto can be considered more as a collection (standard use-case for REST API looks like this /users/4/shipments). Usually you cant (and dont want to) GET/PUT each variable of entity, for example users/4/name . You can get name with GET and change it with PUT at users/4. If there is something after the id, it is usually another collection, like users/4/reviews
3. Use Base64
Send it as JSON but encode files with Base64.
good : Same as first solution, it is as RESTful service as possible.
cons : Once again, testing and debugging is a lot worse (the body can have megabytes of data), there is increase in size and also in processing time in both - client and server
I would really like to use solution no. 2, but it has its cons... Anyone can give me a better insight of "what is best" solution?
My goal is to have RESTful services with as much standards included as possible, while I want to keep it as simple as possible.
OP here (I am answering this question after two years, the post made by Daniel Cerecedo was not bad at a time, but the web services are developing very fast)
After three years of full-time software development (with focus also on software architecture, project management and microservice architecture) I definitely choose the second way (but with one general endpoint) as the best one.
If you have a special endpoint for images, it gives you much more power over handling those images.
We have the same REST API (Node.js) for both - mobile apps (iOS/android) and frontend (using React). This is 2017, therefore you don't want to store images locally, you want to upload them to some cloud storage (Google cloud, s3, cloudinary, ...), therefore you want some general handling over them.
Our typical flow is, that as soon as you select an image, it starts uploading on background (usually POST on /images endpoint), returning you the ID after uploading. This is really user-friendly, because user choose an image and then typically proceed with some other fields (i.e. address, name, ...), therefore when he hits "send" button, the image is usually already uploaded. He does not wait and watching the screen saying "uploading...".
The same goes for getting images. Especially thanks to mobile phones and limited mobile data, you don't want to send original images, you want to send resized images, so they do not take that much bandwidth (and to make your mobile apps faster, you often don't want to resize it at all, you want the image that fits perfectly into your view). For this reason, good apps are using something like cloudinary (or we do have our own image server for resizing).
Also, if the data are not private, then you send back to app/frontend just URL and it downloads it from cloud storage directly, which is huge saving of bandwidth and processing time for your server. In our bigger apps there are a lot of terabytes downloaded every month, you don't want to handle that directly on each of your REST API server, which is focused on CRUD operation. You want to handle that at one place (our Imageserver, which have caching etc.) or let cloud services handle all of it.
small 2023 update: If possible, but CDN in front of the pictures, it usually will save you a lot of money and make the pictures even more available (i.e. no issues when peaks happen).
Cons : The only "cons" which you should think of is "not assigned images". User select images and continue with filling other fields, but then he says "nah" and turn off the app or tab, but meanwhile you successfully uploaded the image. This means you have uploaded an image which is not assigned anywhere.
There are several ways of handling this. The most easiest one is "I don't care", which is a relevant one, if this is not happening very often or you even have desire to store every image user send you (for any reason) and you don't want any deletion.
Another one is easy too - you have CRON and i.e. every week and you delete all unassigned images older than one week.
There are several decisions to make:
The first about resource path:
Model the image as a resource on its own:
Nested in user (/user/:id/image): the relationship between the user and the image is made implicitly
In the root path (/image):
The client is held responsible for establishing the relationship between the image and the user, or;
If a security context is being provided with the POST request used to create an image, the server can implicitly establish a relationship between the authenticated user and the image.
Embed the image as part of the user
The second decision is about how to represent the image resource:
As Base 64 encoded JSON payload
As a multipart payload
This would be my decision track:
I usually favor design over performance unless there is a strong case for it. It makes the system more maintainable and can be more easily understood by integrators.
So my first thought is to go for a Base64 representation of the image resource because it lets you keep everything JSON. If you chose this option you can model the resource path as you like.
If the relationship between user and image is 1 to 1 I'd favor to model the image as an attribute specially if both data sets are updated at the same time. In any other case you can freely choose to model the image either as an attribute, updating the it via PUT or PATCH, or as a separate resource.
If you choose multipart payload I'd feel compelled to model the image as a resource on is own, so that other resources, in our case, the user resource, is not impacted by the decision of using a binary representation for the image.
Then comes the question: Is there any performance impact about choosing base64 vs multipart?. We could think that exchanging data in multipart format should be more efficient. But this article shows how little do both representations differ in terms of size.
My choice Base64:
Consistent design decision
Negligible performance impact
As browsers understand data URIs (base64 encoded images), there is no need to transform these if the client is a browser
I won't cast a vote on whether to have it as an attribute or standalone resource, it depends on your problem domain (which I don't know) and your personal preference.
Your second solution is probably the most correct. You should use the HTTP spec and mimetypes the way they were intended and upload the file via multipart/form-data. As far as handling the relationships, I'd use this process (keeping in mind I know zero about your assumptions or system design):
POST to /users to create the user entity.
POST the image to /images, making sure to return a Location header to where the image can be retrieved per the HTTP spec.
PATCH to /users/carPhoto and assign it the ID of the photo given in the Location header of step 2.
There's no easy solution. Each way has their pros and cons . But the canonical way is using the first option: multipart/form-data. As W3 recommendation guide says
The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
We aren't sending forms,really, but the implicit principle still applies. Using base64 as a binary representation, is incorrect because you're using the incorrect tool for accomplish your goal, in other hand, the second option forces your API clients to do more job in order to consume your API service. You should do the hard work in the server side in order to supply an easy-to-consume API. The first option is not easy to debug, but when you do it, it probably never changes.
Using multipart/form-data you're sticked with the REST/http philosophy. You can view an answer to similar question here.
Another option if mixing the alternatives, you can use multipart/form-data but instead of send every value separate, you can send a value named payload with the json payload inside it. (I tried this approach using ASP.NET WebAPI 2 and works fine).