Explaining if a SHA1 alternative is a good hash function - function

Split the message into 160-bit blocks. Leave the first block unchanged, shift the 2nd block 1 bit left, shift the 3rd block 2 bits left, and so on. The shift wraps bits around, i.e., 1011...00 shifted 2 bits left is 11...0010. Then xor the resulting blocks together to obtain a 160-bit digest.
Is this a good hash function?

Do you need cryptographic strength hash function? Are you planning to use it to:
Verify integrity of sensitive data (files, transactions, passwords, etc.)
Identify yourself, your data or your computations
Use it as part of another cryptographic or verification scheme
If so, use an existing and tested hash function. SHA-1 itself is regarded as vulnerable an unsuitable for secure applications and it's way more secure than your algorithm. Almost everyone (myself included) is dangerously bad at cryptography. If you haven't even tested your hash function, or don't know how to test a hash function, you aren't that good at crypto either.
Or can you get away with a normal hash function? Are you just using this for internal computations that don't get saved or stored? Then you might have a need to roll your own hash function. From your default to 160-bit digests and comparison to SHA-1, it doesn't sound like this is your use case. Most non-crypto hash functions roll out power-of-two digest lengths. Again, however, there are other open sourced and tested hash functions available, so why re-invent the wheel?
If you want to learn, I suggest you implement your hash function in code then put in the effort to test it. See this question for some primer info on this. There is an open source project called SMHasher that "is a test suite designed to test the distribution, collision, and performance properties of non-cryptographic hash functions - it aims to be the "DieHarder" of hash testing, and does a pretty good job of finding flaws with a number of popular hashes."

If you are interested in alternates of SHA-1 then look in to SHA-2 and SHA-3
As of now SHA-2 is most secure hashing algorithm. And about SHA-3 Wikipedia says:
SHA-3 is not meant to replace SHA-2, as no significant attack on SHA-2
has been demonstrated

Related

is there cryptographically secure hash algorithm/function that allows hashing faster when you concatenate more data?

I need to calculate hash codes for constantly growing dataset and it needs to be cryptographically secure if possible. For example hashing the series of natural numbers, first 1, then 12, then 123.
If cryptographically secure hashing always requires doing everything from the start for each added number, then I need some other hashing. The most secure cryptographically, or something that takes as long as possible to find other data that gives the same hash code.
RAM usage can be massive.
What you're looking for is tree hashing, which is a form of Merkle tree. Essentially, each data block is hashed, and a hash is computed over the hash of each's data block hash, and so on until you reach the root. Typically, there are dedicated designs for this that prevent the security problems with the immediate and naïve approach.
If you'd like to incrementally append more data, then you'll obviously need to store the hashes for the data blocks or at least some intermediate values so you can recompute the root hash efficiently. BLAKE2 has a special tree mode which can be used, although I'm not sure if any of the standard libraries support it, so you may need to take the reference code and configure it accordingly. BLAKE2 is cryptographically secure and extremely fast, even in plain C.
There's also the newer BLAKE3, which is supposed to be cryptographically secure and is even faster. It always runs in a tree hash mode. However, it has seen less cryptanalysis than BLAKE2, and so I would recommend BLAKE2 for almost all applications.
There are similar approaches for other hash functions, but these are the fastest cryptogpaphically secure options.

What is differentiable programming?

Native support for differential programming has been added to Swift for the Swift for Tensorflow project. Julia has similar with Zygote.
What exactly is differentiable programming?
what does it enable? Wikipedia says
the programs can be differentiated throughout
but what does that mean?
how would one use it (e.g. a simple example)?
and how does it relate to automatic differentiation (the two seem conflated a lot of the time)?
I like to think about this question in terms of user-facing features (differentiable programming) vs implementation details (automatic differentiation).
From a user's perspective:
"Differentiable programming" is APIs for differentiation. An example is a def gradient(f) higher-order function for computing the gradient of f. These APIs may be first-class language features, or implemented in and provided by libraries.
"Automatic differentiation" is an implementation detail for automatically computing derivative functions. There are many techniques (e.g. source code transformation, operator overloading) and multiple modes (e.g. forward-mode, reverse-mode).
Explained in code:
def f(x):
return x * x * x
∇f = gradient(f)
print(∇f(4)) # 48.0
# Using the `gradient` API:
# ▶ differentiable programming.
# How `gradient` works to compute the gradient of `f`:
# ▶ automatic differentiation.
I never heard the term "differentiable programming" before reading your question, but having used the concepts noted in your references, both from the side of creating code to solve a derivative with Symbolic differentiation and with Automatic differentiation and having written interpreters and compilers, to me this just means that they have made the ability to calculate the numeric value of the derivative of a function easier. I don't know if they made it a First-class citizen, but the new way doesn't require the use of a function/method call; it is done with syntax and the compiler/interpreter hides the translation into calls.
If you look at the Zygote example it clearly shows the use of prime notation
julia> f(10), f'(10)
Most seasoned programmers would guess what I just noted because there was not a research paper explaining it. In other words it is just that obvious.
Another way to think about it is that if you have ever tried to calculate a derivative in a programming language you know how hard it can be at times and then ask yourself why don't they (the language designers and programmers) just add it into the language. In these cases they did.
What surprises me is how long it to took before derivatives became available via syntax instead of calls, but if you have ever worked with scientific code or coded neural networks at at that level then you will understand why this is a concept that is being touted as something of value.
Also I would not view this as another programming paradigm, but I am sure it will be added to the list.
How does it relate to automatic differentiation (the two seem conflated a lot of the time)?
In both cases that you referenced, they use automatic differentiation to calculate the derivative instead of using symbolic differentiation. I do not view differentiable programming and automatic differentiation as being two distinct sets, but instead that differentiable programming has a means of being implemented and the way they chose was to use automatic differentiation, they could have chose symbolic differentiation or some other means.
It seems you are trying to read more into what differential programming is than it really is. It is not a new way of programming, but just a nice feature added for doing derivatives.
Perhaps if they named it differentiable syntax it might have been more clear. The use of the word programming gives it more panache than I think it deserves.
EDIT
After skimming Swift Differentiable Programming Mega-Proposal and trying to compare that with the Julia example using Zygote, I would have to modify the answer into parts that talk about Zygote and then switch gears to talk about Swift. They each took a different path, but the commonality and bottom line is that the languages know something about differentiation which makes the job of coding them easier and hopefully produces less errors.
About the Wikipedia quote that
the programs can be differentiated throughout
At first reading it seems nonsense or at least lacks enough detail to understand it in context which is why I am sure you asked.
In having many years of digging into what others are trying to communicate, one learns that unless the source has been peer reviewed to take it with a grain of salt, and unless it is absolutely necessary to understand, then just ignore it. In this case if you ignore the sentence most of what your reference makes sense. However I take it that you want an answer, so let's try and figure out what it means.
The key word that has me perplexed is throughout, but since you note the statement came from Wikipedia and in Wikipedia they give three references for the statement, a search of the word throughout appears only in one
∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing
Thus, since our ∂P system does not require primitives to handle new
types, this means that almost all functions and types defined
throughout the language are automatically supported by Zygote, and
users can easily accelerate specific functions as they deem necessary.
So my take on this is that by going back to the source, e.g. the paper, you can better understand how that percolated up into Wikipedia, but it seems that the meaning was lost along the way.
In this case if you really want to know the meaning of that statement you should ask on the Wikipedia talk page and ask the author of the statement directly.
Also note that the paper referenced is not peer reviewed, so the statements in there may not have any meaning amongst peers at present. As I said, I would just ignore it and get on with writing wonderful code.
You can guess its definition by application of differentiability.
It's been used for optimization i.e. to calculate minimum value or maximum value
Many of these problems can be solved by finding the appropriate function and then using techniques to find the maximum or the minimum value required.

Are cryptographic hashes injective under certain conditions?

sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.
However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?
I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.
Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.
For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.
So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:
split the host name into components at "." and hash those individually
check the path into components at "." and hash those individually
Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".
However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").
So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).
I do not believe hashes are guaranteed to be injective for messages the same size the digest. If they were, they would be bijective, which would be missing the point of a hash. This suggests that they are not injective for messages smaller than the digest either.
If you want to encourage collisions, i suggest you use any hash function you like, then throw away bits until it collides enough.
For example, throwing away 159 bits of a SHA-1 hash will give you a pretty high collision rate. You might not want to throw that many away.
However, what you are trying to achieve seems inherently dubious. You want to be able to tell that the URL is one of your ones, but not which one it is. That means you want your URLs to collide with each other, but not with URLs that are not yours. A hash function won't do that for you. Rather, because collisions will be random, since there are many more URLs which are not yours than which are (i assume!), any given level of collision will lead to dramatically more confusion over whether a URL is one of yours or not than over which of yours it is.
Instead, how about sending the list of URLs to the plugin at startup, and then having it just send back a single bit indicating if it's visiting a URL in the list? If you don't want to send the URLs explicitly, send hashes (without trying to maximise collisions). If you want to save space, send a Bloom filter.
Since you're willing to accept a rate of false positives (that is, random sites identified as whitelisted when in fact they are not), a Bloom filter might be just the thing.
Each client downloads a Bloom filter containing the whole whitelist. Then the client has no need to otherwise communicate with the server, and there is no risk of spying.
At 2 bytes per URL, the false positive rate would be below 0.1%, and at 4 bytes per URL below 1 in 4 million.
Downloading the whole filter (and perhaps regular updates to it) is a large investment of bandwidth up front. But supposing that it has a million URLs on it (which seems quite a lot to me, given that you can probably apply some rules to canonicalize URLs before lookup), it's a 4MB download. Compare this with a list of a million 32 bit hashes: same size, but the false positive rate would be somewhere around 1 in 4 thousand, so the Bloom filter wins for compactness.
I don't know how the plugin contacts the server, but I doubt that you can get an HTTP transaction done in much under 1kB -- perhaps less with keep-alive connections. If filter updates are less frequent than one per 4k URL visits by a given user (or a smaller number if there are less than a million URLs, or greater than 1 in 4 million false positive probability), this has a chance of using less bandwidth than the current scheme, and of course leaks much less information about the user.
It doesn't work quite as well if you require new URLs to be whitelisted immediately, although I suppose the client could still hit the server at every page request, to check whether the filter has changed and if so download an update patch.
Even if the Bloom filter is too big to download in full (perhaps for cases where the client has no persistent storage and RAM is limited), then I think you could still introduce some information-hiding by having the client compute which bits of the Bloom filter it needs to see, and asking for those from the server. With a combination of caching in the client (the higher the proportion of the filter you have cached, the fewer bits you need to ask for and hence the less you tell the server), asking for a window around the actual bit you care about (so you don't tell the server exactly which bit you need), and the client asking for spurious bits it doesn't really need (hide the information in noise), I expect you could obfuscate what URLs you're visiting. It would take some analysis to prove how much that actually works, though: a spy would be aiming to find a pattern in your requests that's correlated with browsing a particular site.
I'm under the impression that you actually want public key cryptography, where you provide the visitor with a public key used to encode the URL, and you decrypt the URL using the secret key.
There are JavaScript implementations a bit everywhere.

Pseudo-random number generator

What is the best way to create the best pseudo-random number generator? (any language works)
Best way to create one is to not to.
Pseudo-random number generators are a very complex subject, so it's better off to use the implementations produced by the people that have a good understanding of the subject.
It all depends on the application. The generator that creates the "most random" numbers might not be the fastest or most memory-efficient one, for example.
The Mersenne Twister algorithm is a popular, fairly fast pseudo-random number generator that produces quite good results. It has a humongously large period, but also a relatively humongous state (2.5 kB). However it is not deemed good enough for cryptographic applications.
Update: Since this answer was written, the PCG family of algorithms was published that seems to outperform existing non-cryptographic algorithms on most fronts (speed, memory, randomness and period), making it an excellent all-round choice for anything but cryptography.
If you're doing crypto though, my answer remains: don't roll your own.
The German magazine C't tested a number of software and hardware generators in the 2/2009 issue and ran the results through various statistical tests.
I scanned the results here.
I would not bother writing my own. The article mentions that even Donald Knuth failed with his "Super-random number generator", which was not so random after all. Get one that passed all tests (had a result > 0 in all columns). They also tested a setup with a VIA EPIA M10000 mobo, which has a hardware RNG. I like this option for a commercial or semi-commercial setup that requires a robust random number server with high throughput.
Unless, of course, you are just playing around, in which case this may be good enough.
PRNG algorithms are complicated, as is acquiring the right sources of entropy to make them work well. This is not something you want to do yourself. Every modern language has a PRNG library that will almost certainly be suitable for your use.
Yikes, that can get VEEEEEERY complicated! There seem to be a number of metrics for how to measure the "randomness" of a random number generator, so it's difficult to meaure which are "best". I would start with Numerical Recipes in C (or whatever langauge you can find one for) for a few examples. I coded up my first simple one from the examples given there.
EDIT: It's also important to start by determining how complex you need your random number generator to be. I remember a rude awakening I had in C years ago when I discovered that the default random number generator had a period somewhere around 32,767, meaning that it tended to repeat itself periodically after generating that many numbers! If you need a few dice rolls, that's fine. But not when you need to generate millions of "random" values for a simulation.
See Pitfalls in Random Number Generation
See this link for the TestU01 suite of tests, which includes several batteries of tests.
http://www.iro.umontreal.ca/~simardr/testu01/tu01.html
In the paper, the author demonstrates the test result on a variety of existing RNGs, but not .NET System.Random (as far as I can tell). Though he does test VB6's generator.
Very few pass all the tests...
Steal the one out of knuth seminumeric.
It is high quality and simple to implement.
It uses a pair of arrays, addition, and a couple of ifs.
Cheap, effective, and a nice long period 2^55 if i recall correctly.
If you're going to work in C++, Boost has a collection of PRNGs that I'd trust a lot more than whatever comes in standard libraries. The documentation might be helpful in picking one out. As always, how good a PRNG is depends on what you're using it for.
https://github.com/fsssosei/Pure_PRNG
Python libraries for PRNG algorithms that pass statistical tests
In production code it is definitely safer to leverage an established library, but understanding how pseudorandom number generators work and writing your own can be fun and worthwhile from an educational standpoint.
There are many different techniques, but one straight forward approach is based on the logistic map. I.e.
x = r * x * (1 - x)
With the right value for r, values of x demonstrate chaotic and unpredictable behavior. There are pitfalls to be aware of which you can read about in the references below.
References
https://tim.cogan.dev/random-num/
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C44&q=Logistic+map%3A+A+possible+random-number+generator&btnG=
My favorites are Hardware random number generators

how are serial generators / cracks developed?

I mean, I always was wondered about how the hell somebody can develop algorithms to break/cheat the constraints of legal use in many shareware programs out there.
Just for curiosity.
Apart from being illegal, it's a very complex task.
Speaking just at a teoretical level the common way is to disassemble the program to crack and try to find where the key or the serialcode is checked.
Easier said than done since any serious protection scheme will check values in multiple places and also will derive critical information from the serial key for later use so that when you think you guessed it, the program will crash.
To create a crack you have to identify all the points where a check is done and modify the assembly code appropriately (often inverting a conditional jump or storing costants into memory locations).
To create a keygen you have to understand the algorithm and write a program to re-do the exact same calculation (I remember an old version of MS Office whose serial had a very simple rule, the sum of the digit should have been a multiple of 7, so writing the keygen was rather trivial).
Both activities requires you to follow the execution of the application into a debugger and try to figure out what's happening. And you need to know the low level API of your Operating System.
Some heavily protected application have the code encrypted so that the file can't be disassembled. It is decrypted when loaded into memory but then they refuse to start if they detect that an in-memory debugger has started,
In essence it's something that requires a very deep knowledge, ingenuity and a lot of time! Oh, did I mention that is illegal in most countries?
If you want to know more, Google for the +ORC Cracking Tutorials they are very old and probably useless nowdays but will give you a good idea of what it means.
Anyway, a very good reason to know all this is if you want to write your own protection scheme.
The bad guys search for the key-check code using a disassembler. This is relative easy if you know how to do this.
Afterwards you translate the key-checking code to C or another language (this step is optional). Reversing the process of key-checking gives you a key-generator.
If you know assembler it takes roughly a weekend to learn how to do this. I've done it just some years ago (never released anything though. It was just research for my game-development job. To write a hard to crack key you have to understand how people approach cracking).
Nils's post deals with key generators. For cracks, usually you find a branch point and invert (or remove the condition) the logic. For example, you'll test to see if the software is registered, and the test may return zero if so, and then jump accordingly. You can change the "jump if equals zero (je)" to "jump if not-equals zero (jne)" by modifying a single byte. Or you can write no-operations over various portions of the code that do things that you don't want to do.
Compiled programs can be disassembled and with enough time, determined people can develop binary patches. A crack is simply a binary patch to get the program to behave differently.
First, most copy-protection schemes aren't terribly well advanced, which is why you don't see a lot of people rolling their own these days.
There are a few methods used to do this. You can step through the code in a debugger, which does generally require a decent knowledge of assembly. Using that you can get an idea of where in the program copy protection/keygen methods are called. With that, you can use a disassembler like IDA Pro to analyze the code more closely and try to understand what is going on, and how you can bypass it. I've cracked time-limited Betas before by inserting NOOP instructions over the date-check.
It really just comes down to a good understanding of software and a basic understanding of assembly. Hak5 did a two-part series on the first two episodes this season on kind of the basics of reverse engineering and cracking. It's really basic, but it's probably exactly what you're looking for.
A would-be cracker disassembles the program and looks for the "copy protection" bits, specifically for the algorithm that determines if a serial number is valid. From that code, you can often see what pattern of bits is required to unlock the functionality, and then write a generator to create numbers with those patterns.
Another alternative is to look for functions that return "true" if the serial number is valid and "false" if it's not, then develop a binary patch so that the function always returns "true".
Everything else is largely a variant on those two ideas. Copy protection is always breakable by definition - at some point you have to end up with executable code or the processor couldn't run it.
The serial number you can just extract the algorithm and start throwing "Guesses" at it and look for a positive response. Computers are powerful, usually only takes a little while before it starts spitting out hits.
As for hacking, I used to be able to step through programs at a high level and look for a point where it stopped working. Then you go back to the last "Call" that succeeded and step into it, then repeat. Back then, the copy protection was usually writing to the disk and seeing if a subsequent read succeeded (If so, the copy protection failed because they used to burn part of the floppy with a laser so it couldn't be written to).
Then it was just a matter of finding the right call and hardcoding the correct return value from that call.
I'm sure it's still similar, but they go through a lot of effort to hide the location of the call. Last one I tried I gave up because it kept loading code over the code I was single-stepping through, and I'm sure it's gotten lots more complicated since then.
I wonder why they don't just distribute personalized binaries, where the name of the owner is stored somewhere (encrypted and obfuscated) in the binary or better distributed over the whole binary.. AFAIK Apple is doing this with the Music files from the iTunes store, however there it's far too easy, to remove the name from the files.
I assume each crack is different, but I would guess in most cases somebody spends
a lot of time in the debugger tracing the application in question.
The serial generator takes that one step further by analyzing the algorithm that
checks the serial number for validity and reverse engineers it.