IPFS CID Reproducibility - ipfs

This is perhaps very basic question but I couldn’t find in my search. Assuming the file content doesn’t change even single bit, would IPFS CID never change even if it’s being added/pinned by different users?
Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?

Would this assumption stay true for foreseeable future? I know it depends on hash algorithm, so just by knowing what hash algorithm was used (SHA-256), the IPFS CID would be reproducable for foreseeable feature right? Or is there other information that needs to be stored as well?
No, it is not safe to assume that ipfs add <file> will always give the same CID as there are many parameters other than the hash function itself that the binary is free to change over time. At a high level ipfs add turns a file/directory in a tree structure called UnixFS that represents that data, and since the default way in which ipfs add is allowed to change over time it means that CID output by ipfs add example.txt can change
Many of the UnixFS parameters are configurable (and described in ipfs add --help) and include options such as raw leaves and chunk size. This means that if you'd really like to ensure that ipfs add example.txt results in the same CID there are a set of flags you can pass to ensure this is the case.
Note, in general I'd try to avoid importing the same data to IPFS multiple times (it's a waste of resources anyway) although there may be some scenarios where that's just the easiest, or best, thing to do to get your project off the ground.

As long as there won't be any default settings changed (and the user doesn't set any specific ones to override them), all clients will generate the same CID from the same binary data.
But on the other hand, there's talk to replace SHA-256 with Blake2b in the future, and there are two versions of CIDs: v0 and v1.
Most custom settings require v1 of the CID.
The other things which influence the CID are
--raw-leaves
--chunker as well as
--trickle
And
--inline as well as --inline-limit when you're dealing with small files.

IPFS CIDs are not truly self describing in the sense that they don't capture information about how the data was chunked and what kind of merkle dag was created.
If you want the CIDs to be the same, I would suggest using content archive files (and uploading that). Content archive files would contain information about the various parameters #Ruben described in this answer, and different ipfs clients will be forced to use those parameters even if their default settings are different.
For example, I was using web3.storage (chunking size 1MB) and pinata (chunking size 256kb), uploading the same file to them gave different CIDs which was unacceptable for our purposes. Using CAR files allowed to have the same CIDs.

Related

Asp.NET Core 2 Images

I have a couple of questions about images, since I don't know what is better for my purposes. Also this might me helpful for other people because I couldn't find this info in other questions.
Well, although this is an asp.net core 2.0 application the first question could is a general question about images.
QUESTION 1
When I have images that I want to load everytime I usually add a query string so the explorers like Chrome or IE don't get the chached image they have. In my case I add the time ticks to the url of the image, this way it loads the image everytime since the query string is always different:
filePath += "?" + DateTime.Now.Ticks;
But in my case I have a panel where the administrators of the page can change a lot of images. The problem, when they change those images if there is no query string the users are going to see an old image they have stored in their explorer cache.
The question is, if I add the query string to many images is not bad for the performance? is there any other solution for this?
QUESTION 2
I also have photos of the users and other images stored in the site. When I saw a image all the visitors of the site can see the path (for example: www.site.com/user_files/user_001/photo001.jpg).
Is there a way to hide those paths or transform in another thing is asp.net core 2.0?
Thanks a lot.
Using something like ticks will get the job done, but in a very naive way. You're going to put more stress both on your server and the clients, since effectively the image will have to be refetched every single time, regardless of whether it has changed or not. If you will have any mobile users, the situation is far worse for them, as they'll be forced to redownload all these resources over and over, usually over limited (and costly) data plans.
A far better approach is to use a cryptographic digest, often called a "hash". Essentially, the same data encrypted in the same way will return the same hash. It's usually used to detect tampering with transmitted data, but since each message will (generally) have a unique hash and that hash will be the same each time for the same piece of data, you can also use this to generate a cache-busting query string that only changes when the image data itself changes.
Now, to be thorough, there's technically no guarantee that two messages won't result in the same hash. Instances where that occurs are called "collisions" and they can happen. However, if you use a sufficiently complex algorithm like SHA256, the likelihood of collisions is greatly reduced. Regardless, it should not be a real issue for concern for this particular use case of cache-busting images.
Simplistically, to create the hash, you simply do something like:
string hash;
using (var sha256 = SHA256.Create())
{
hash = Convert.ToBase64String(sha256.ComputeHash(imageBytes));
}
The value of hash then will be something like z1JZs/EwmDGW97RuXtRDjlt277kH+11EEBHtkbVsUhE=.
However, ASP.NET Core has an ImageTagHelper built-in that will handle this for you. Essentially, you just need to do:
<img src="/path/to/image.jpg" asp-append-version="true" />
As for your second question, about hiding or obfuscating the image path, that's not strictly possible, but can be worked around. The URL you use to reference the image uniquely identifies that resource. If you change it in any way, it's effectively not the same resource any more, and thus, would not locate the actual image you wanted to display. So, in a strict sense, no, you cannot change the URL. However, you can proxy the request through a different URL, effectively obfuscating the URL for the original image.
Simply, you'd just have an action on some controller that takes an image path (as part of the query string), loads that from the filesystem and returns it as a response. Care should be taken limit the scope of files that can be returned like this, both based on directory (only allow your image directory, for example, not C:\Windows\, etc.) and file type (only allow images to be returned, not random text files, config files, etc.). That portion is straight-forward enough, and you can find many examples online if you need them.
Ultimately, this doesn't really solve anything, though, because now your image path is simply in the query string instead. However, now that you've set this part up, you can encrypt that part of the query string using the Data Protection API. There's some basic getting started information available in the docs. Essentially, you're just going to encrypt the image path when creating the URL, and then in your action that returns the image, you decrypt the path first before running the rest of the code. For the encryption part, you can create a tag helper to do this for you without having to have a ton of logic in your views.

SpamAssassin: Site Wide Bayes not working?

Long ago I implemented site-wide bayes filtering as per http://wiki.apache.org/spamassassin/SiteWideBayesSetup.
I don’t think it ever worked, and I certainly find that my spam scores are always negative, with BAYES_00 suggesting that Bayes wasn’t used at all.
Here is what I have in my local.cf file:
bayes_path /etc/mail/spamassassin/bayes/bayes
bayes_file_mode 0777
When I run sa-learn I find instead that the tokens are stored in individual home directories.
What is the correct method to get this working?
Supplementary Question: if I can get this working, can I combine the various bayes_tok and other files?
If you get BAYES_00 results, then Bayes is indeed working as it has classified the email as being ham. A neutral result would be BAYES_50. You just need to train the Bayes database properly.
If sa-learn creates/updates bayes files under your home directory, then it is either not reading the desired local.cf file, or the bayes_path gets overridden by a user-specific configuration file (e.g. /root/.spamassassin/user_config).
You could try one of the following:
run sa-learn under the same user account as spamassassin is executed
specify an explicit path to sa-learn, i.e.
sa-learn --dbpath /etc/mail/spamassassin/bayes/bayes
use the -D option to see what is really going on, i.e. which configuration files are being read, etc.
If/when you get it working you can generally not combine the various database files. There are at least a bayes_toks and a bayes_seen file, because one contains the tokens learned and the other has email Message-Id:s and associated training status (spam/ham). Then there can be an optional bayes_journal if you use deferred syncing.
Further details available in the manpage for sa-learn:
https://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html

Are cryptographic hashes injective under certain conditions?

sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.
However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?
I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.
Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.
For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.
So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:
split the host name into components at "." and hash those individually
check the path into components at "." and hash those individually
Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".
However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").
So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).
I do not believe hashes are guaranteed to be injective for messages the same size the digest. If they were, they would be bijective, which would be missing the point of a hash. This suggests that they are not injective for messages smaller than the digest either.
If you want to encourage collisions, i suggest you use any hash function you like, then throw away bits until it collides enough.
For example, throwing away 159 bits of a SHA-1 hash will give you a pretty high collision rate. You might not want to throw that many away.
However, what you are trying to achieve seems inherently dubious. You want to be able to tell that the URL is one of your ones, but not which one it is. That means you want your URLs to collide with each other, but not with URLs that are not yours. A hash function won't do that for you. Rather, because collisions will be random, since there are many more URLs which are not yours than which are (i assume!), any given level of collision will lead to dramatically more confusion over whether a URL is one of yours or not than over which of yours it is.
Instead, how about sending the list of URLs to the plugin at startup, and then having it just send back a single bit indicating if it's visiting a URL in the list? If you don't want to send the URLs explicitly, send hashes (without trying to maximise collisions). If you want to save space, send a Bloom filter.
Since you're willing to accept a rate of false positives (that is, random sites identified as whitelisted when in fact they are not), a Bloom filter might be just the thing.
Each client downloads a Bloom filter containing the whole whitelist. Then the client has no need to otherwise communicate with the server, and there is no risk of spying.
At 2 bytes per URL, the false positive rate would be below 0.1%, and at 4 bytes per URL below 1 in 4 million.
Downloading the whole filter (and perhaps regular updates to it) is a large investment of bandwidth up front. But supposing that it has a million URLs on it (which seems quite a lot to me, given that you can probably apply some rules to canonicalize URLs before lookup), it's a 4MB download. Compare this with a list of a million 32 bit hashes: same size, but the false positive rate would be somewhere around 1 in 4 thousand, so the Bloom filter wins for compactness.
I don't know how the plugin contacts the server, but I doubt that you can get an HTTP transaction done in much under 1kB -- perhaps less with keep-alive connections. If filter updates are less frequent than one per 4k URL visits by a given user (or a smaller number if there are less than a million URLs, or greater than 1 in 4 million false positive probability), this has a chance of using less bandwidth than the current scheme, and of course leaks much less information about the user.
It doesn't work quite as well if you require new URLs to be whitelisted immediately, although I suppose the client could still hit the server at every page request, to check whether the filter has changed and if so download an update patch.
Even if the Bloom filter is too big to download in full (perhaps for cases where the client has no persistent storage and RAM is limited), then I think you could still introduce some information-hiding by having the client compute which bits of the Bloom filter it needs to see, and asking for those from the server. With a combination of caching in the client (the higher the proportion of the filter you have cached, the fewer bits you need to ask for and hence the less you tell the server), asking for a window around the actual bit you care about (so you don't tell the server exactly which bit you need), and the client asking for spurious bits it doesn't really need (hide the information in noise), I expect you could obfuscate what URLs you're visiting. It would take some analysis to prove how much that actually works, though: a spy would be aiming to find a pattern in your requests that's correlated with browsing a particular site.
I'm under the impression that you actually want public key cryptography, where you provide the visitor with a public key used to encode the URL, and you decrypt the URL using the secret key.
There are JavaScript implementations a bit everywhere.

How to organize files in the filesystem for an upload-type site?

I'm wondering if there are any best practices for organizing files on the filesystem for a site that centers around users uploading files. (Not a hosting site like Imageshack, more like addons.mozilla.org)
Or am I over-analyzing this and should put everything in one folder?
I tend to think about user uploads as just another kind of user data, and so it all goes into a database. Obviously, make sure the database you are going to use for this is a good choice for that, for example, a SQL database isn't necessarily right.
If it makes sense, I try to use a url pattern that makes sense in the context of the usage pattern of the site, for example:
example.com/username/users_file.jpg
If there's just no obvious way to do that, and I have to use a surrogate key, I just live with it:
example.com/files/abc123
example.com/files/abc123/
example.com/files/abc123/users_file.jpg
All three are the same file. in particular, the abc123 is all that the app needs to look up the file, the extra bit at the end is there so that browsers get a good hint at what the file should be named when it's saved to disk.
Doing it this way means that no matter what the original file is named, it always is unique to the user. Even if the user wishes to upload 100 files with the same name, all are unique.
First (and probably obviously), put the users' files in some dedicated place so they don't risk overwriting other stuff.
Second, if you expect lots of files then you may want to have subfolders. The easiest way to do that is to use the first letter of their filename as the folder.
So if I were to upload "smile.jpg", you could store it there: s/smile.jpg
If you're super popular and still have too many files, you can use more letters. And if you expect to have tons of users and you have tons of servers, you can imagine splitting the work by saving on s.example.com/upload/s/smile.jpg (but really if you have tons of servers then you probably already have a transparent way of sharing storage and load).

Does anyone know how I can store large binary values in Riak?

Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)