I just noticed something about MySql which I haven't noticed before.
If you use any of the MySql hashing functions MD5, SHAx, Password, etc they all seem to return the same hash. This happens across all schemas and databases regardless of their installed instance
I have a local MySql server database, and two others hosted with different internet providers
If I do the following
select MD5('Password');
select Sha1('Password');
select Sha2('Password', 224);
select Password('Password');
each function will yield the same result across all for that function across all instances of MySql
For example if I do select MD5('Password') it gives this same dc647eb............12b3964 hash for MD5 regardless of on all of my servers. This looks a little bit suspect to me and sounds like a security hole.
Has anyone noticed this and is there anything that can be done about it?
MD5, SHA1 and SHA2 are simple cryptographic hashes that for any given input will, by design, produce exactly the same output. This is how they are intended to be used. You don't want the SHA2 file signature of something to come out differently each time you run the hash. They're also designed to be fast to compute.
You want things like SHA2(x) to always produce the same output for any given x so that if you have a file and a signature you can see if the file has been in any way tampered with by computing the hash and comparing it.
Password-specific hashes like Bcrypt, which you might be thinking of, work differently and produce random output. This makes them way more resistant to brute-force password guessing attacks. They're also designed to be slow, often tens if not millions of times slower than their MD5 or SHA counterpart.
You want, effectively BCRYPT(x) to be random and unpredictable for any given x so that you cannot infer x from the output.
Yes, using MD5 or SHA for passwords is a huge security problem especially if the input is unsalted. Just search for dc647eb65e6711e155375218212b3964 in your favorite search engine and see what comes up: it's instantly "dehashed". You can use a search engine as what used to be termed a Rainbow Table.
SHA and MD5 were used, extensively, for hashing passwords mostly because it was the best option at the time. Computers were also far, far slower, and GPU options didn't exist, so the risk of compromise was vastly reduced. Now tools like Hashcat exist that can crack even "difficult" passwords if someone's careless enough to use a weak hash.
Related
I have to encrypt several data in my DB (no password)
I would use a salt to prevent rainbow attack.
I'm generating a salt like that:
mysalt = UNHEX(SHA2(RAND(),512))
is RAND() (mysql function) an enough source of entropy? I should have all my salts value different each other, but if my PRNG has too much collision isn't the case. Does it depend from number of records in DB? If it is the case what is the limit with RAND()? Which could be a good alternative technique if that above isn't good? Finally is that good to salt passwords too?
I have to encrypt several data in my DB (no password) I would use a salt to prevent rainbow attack.
As already commented, you should use proper terminology to make sure we're talking about the same thing. Based on the comments let's assume you're generating an IV (initialization vector) for a block mode (cbc, cfb or ofb). The IV is enabling the safe key reuse. Nothing to do with rainbow tables.
mysalt = UNHEX(SHA2(RAND(),512))
RAND() generates a float between 0 and 1. Your function effectively uses the float as a string, so proper notation would be sha2(convert(RAND(),CHAR),128), AES is using 128 bit IV anyway. No reason to generate more.
is RAND() (mysql function) an enough source of entropy?
Depends on the modes used. Some modes (CTR, CFB) the IV is required to be unique (event a simple counter is good enough), for the CBC and OFB mode the IV needs to be "unpredictable" (stronger term than random).
I am unable to find source state of the RAND() function and then I am unable to guarantee it cannot be brute-forced (it can be done in case of timestamp-based source for NLFSR PRNG). I'm not really willing to dig into the mysql source code (as far I recall mysql uses the OS rand function, so it will depend on the underlying system, I may be wrong here). So RAND() can be safe to generate the IV for CBC, under assumptions that the initial state is random enough. Seems no-one here is able to confirm nor deny this assumption. Using RANDOM_BYTES() it is guaranteed to use a cryptographically secure random source.
Does it depend from number of records in DB?
If it is the case what is the limit with RAND()?
You need a unique combination of the key-iv. The more records you have the higher probability of the value collision. And here we come again with with the initial RAND state size which is system-depended, though in the comment someone claims it's 4 bytes. That's really not too much.
Which could be a good alternative technique if that above isn't good?
Finally is that good to salt passwords too?
As already commented, RANDOM_BYTES are using the random generator from the SSL library, which is required to be cryptographically feasible.
I have never used apc_store() before, and I'm also not sure about whether to free query results or not. So I have these questions...
In a MySQL Query Cache article here, it says "The MySQL query cache is a global one shared among the sessions. It caches the select query along with the result set, which enables the identical selects to execute faster as the data fetches from the in memory."
Does using free_result() after a select query negate the caching spoken of above?
Also, if I want to set variables and arrays obtained from the select query for use across pages, should I save the variables in memory via apc_store() for example? (I know that can save arrays too.) And if I do that, does it matter if I free the result of the query? Right now, I am setting these variables and arrays in an included file on most pages, since they are used often. This doesn't seem very efficient, which is why I'm looking for an alternative.
Thanks for any help/advice on the most efficient way to do the above.
MySQL's "Query cache" is internal to MySQL. You still have to perform the SELECT; the result may come back faster if the QC is enabled and usable in the situation.
I don't think the QC is what you are looking for.
The QC is going away in newer versions. Do not plan to use it.
In PHP, consider $_SESSION. I don't know whether it is better than apc_store for your use.
Note also, anything that is directly available in PHP constrains you to a single webserver. (This is fine for small to medium apps, but is not viable for very active apps.)
For scaling, consider storing a small key in a cookie, then looking up that key in a table in the database. This provides for storing arbitrary amounts of data in the database with only a few milliseconds of overhead. The "key" might be something as simple as a "user id" or "session number" or "cart number", etc.
I need to store user agent strings in a database for tracking and comparing customer behavior and sales performance between different browsers. A pretty plain user agent string is around 100 characters long. It was decided to use a varchar(1024) for holding the useragent data in the database. (I know this is overkill, but that's the idea; it's supposed to accommodate useragent data for years to come and some devices, toolbars, applications are already pushing 500 characters in length.) The table holding these strings will be normalized (each distinct user agent string will only be stored once) and treated like a cache so we don't have to interpret user agents over and over again.
The typical use case is:
User comes to our site, is detected as being a new vistor
New session information is created for this user
Determine if we need to analyze the user agent string or if we have a valid analysis on file for it
If we have it, great, if not, analyze it (currently, we plan on calling a 3rd party API)
Store the pertinent information (browser name, version, os etc.) in a join table tied the existing user session information and pointing to the cache entry
Note: I have a tendency to say 'searching' for the user agent string in the database because it's not a simple look up. But to be clear, the queries are going to use '=' operators, not regexes or LIKE % syntax.
So the speed of looking up the user agent string is paramount. I've explored a few methods of making sure it will have good performance. Indexing the whole column is right out for size reasons. A partial index isn't such a good idea either because most user agents have the distinguishing information at the end; the partial index would have to be fairly long to make it worthwhile by which point its size is causing problems.
So it comes down to a hash function. My thought is to hash the user agent string in web server code and run the select looking for the hash value in the database. I feel like this would minimize the load on the database server (as opposed to having it compute the hash), especially since if the hash isn't found, the code would turn around and ask the database to compute the hash again on the insert.
Hashing to an integer value would offer the best performance at the risk of higher collisions. I'm expecting to see thousands or tens of thousands user agents at the most; even 100,000 user agents would fit reasonably well into a 2^32 size integer with very few collisions which could be deciphered by the webservice with minimal performance impact. Even if you think an integer hash isn't such a good idea, using a 32 character digest (SHA-1, MD5 e.g.) should be much faster for selects than the raw string, right?
My database is MySQL InnoDB engine. The web code will be coming from C# at first and php later (after we consolidate some hosting and authentication) (not that the web code should make a big difference).
Let me apologize at this point if you think this is lame choose-my-hash-algorithm question. I'm really hoping to get some input from people who've done something similar before and their decision process. So, the question:
Which hash would you use for this application?
Would you compute the hash in code or let the db handle it?
Is there a radically different approach for storing/searching long strings in a database?
Your idea of hashing long strings to create a token upon which to lookup within a store (cache, or database) is a good one. I have seen this done for extremely large strings, and within high volume environments, and it works great.
"Which hash would you use for this application?"
I don't think the encryption (hashing) algorithm really matters, as you are not hashing to encrypt data, you are hashing to create a token upon which to use as a key to look up longer values. So the choice of hashing algorithm should be based off of speed.
"Would you compute the hash in code or let the db handle it?"
If it were my project, I would do the hashing at the app layer and then pass it through to look up within the store (cache, then database).
"Is there a radically different approach for storing/searching long strings in a database?"
As I mentioned, I think for your specific purpose, your proposed solution is a good one.
Table recommendations (demonstrative only):
user
id int(11) unsigned not null
name_first varchar(100) not null
user_agent_history
user_id int(11) unsigned not null
agent_hash varchar(255) not null
agent
agent_hash varchar(255) not null
browser varchar(100) not null
agent text not null
Few notes on schema:
From your OP it sounds like you need a M:M relationship between user and agent, due to the fact that a user may be using Firefox from work, but then may switch to IE9 at home. Hence the need for the pivot table.
The varchar(255) used for agent_hash is up for debate. MySQL suggests using a varbinary column type for storing hashes, of which there are several types.
I would also suggest either making agent_hash a primary key, or at the very least, adding a UNIQUE constraint to the column.
Your hash idea is sound. I've actually used hashing to speed up some searches on millions of records. A hash index will be quicker since each entry is the same size. md5 will likely be fine in your case and will probably give you the shortest hash length. If you are worried about hash collisions, you can add include the length of the agent string.
I have a system that needs to schedule some stuff and return identifiers to the scheduled tasks to some foreign objects. The user would basically do this:
identifier = MyLib.Schedule(something)
# Nah, let's unschedule it.
MyLib.Unschedule(identifier)
I use this kind of pattern a lot in internal code, and I always use plain integers as the identifier. But if the identifiers are used by untrusted code, a malicious user could break the entire system by doing a single Unschedule(randint()).
I need the users of the code to be able to only unschedule identifiers they have actually scheduled.
The only solution I can think of is to generate i.e 64-bit random numbers as identifiers, and keep track of which identifiers are currently handed out to avoid the ridiculously unlikely duplicates. Or 128-bit? When can I say "this is random enough, no duplicates could possibly occur", if ever?
Or better yet, is there a more sensible way to do this? Is there a way to generate identifier tokens that the generator can easily keep track of (avoiding duplicates) but is indistinguishable from random numbers to the recipient?
EDIT - Solution based on the accepted answer:
from Crypto.Cipher import AES
import struct, os, itertools
class AES_UniqueIdentifier(object):
def __init__(self):
self.salt = os.urandom(8)
self.count = itertools.count(0)
self.cipher = AES.new(os.urandom(16), AES.MODE_ECB)
def Generate(self):
return self.cipher.encrypt(self.salt +
struct.pack("Q", next(self.count)))
def Verify(self, identifier):
"Return true if identifier was generated by this object."
return self.cipher.decrypt(identifier)[0:8] == self.salt
Depending on how many active IDs you have, 64 bits can be too little. By the birthday paradox, you'd end up with essentially the level of protection you might expect from 32 bit identifiers.
Besides, probably the best way to create these is to use some salted hash function, such as SHA-1 or MD5 or whatever your framework already has, with a randomly chosen salt (kept secret), and those generate at least 128 bits anyway, exactly for the reason mentioned above. If you use something that creates longer hash values, I don't really see any reason to truncate them.
To create identifiers you can check without storing them, take something easy to detect, such as having the same 64 bit patterns twice (giving a total of 128 bits) and encrypt that with some constant secret key, using AES or some other cipher with a block size of 128 bits (or whatever you picked). If and when the user sends some alleged key, decrypt and check for your easy-to-spot pattern.
It sounds to me like you might be over thinking this problem. This sounds 100% like an application for a GUID/UUID. Python even has a built in way to generate them. The whole point of GUID/UUIDs is that the odds of collision are astronomical, and by using a string instead of an encrypted token you can skip the decrypting operation in the verify step. I think this would also eliminate a whole slew of problems you might encounter regarding key management, and increase the speed of the whole process.
EDIT:
With a UUID, your verify method would just be a comparison between the given UUID and the stored one. Since the odds of a collision between two UUIDs is incredibly low, you shouldn't have to worry about false positives. In your example, it appears that the same object is doing both encryption and decryption, without a third party reading the stored data. If this is the case, you aren't gaining anything by passing around encrypted data except that the bits your passing around aren't easy to guess. I think a UUID would give you the same benefits, without the overhead of the encryption operations.
You make your identifier long enough, so it can't be reasonable guessed. In addition, let Unschedule wait for 1 second, if the token is not in use, so a brute force attack is not feasible anymore. Like the other answer said, session IDs in Webapplications are exactly the same problem, and I already saw session IDs which where 64 random characters long.
This is the same problem as dealing with session identifiers in ordinary web applications. Predictable session ids can easily lead to session hijacking.
Have a look at how session ids are generated. Here the content of a typical PHPSESSID cookie:
bf597801be237aa8531058dab94a08a9
If you want to be dead sure no brute-force attack is feasible, do the calculations backward: How many attempts can a cracker do per second? How many different unique id's are used at a random point in time? How many id's are there in total? How long would it take for the cracker to cover, say 1 % of the total space of ids? Adjust number of bits accordingly.
Do you need this pattern in a distributed or local environment?
If you're local, most OO languages should support the notion of object identity, so if you create an opaque handle - just create a new object.
handle = new Object(); // in Java
No other client can fake this.
If you need to use this in distributes environments, you may keep a pool of handles per session, so that a foreign session can never use a stolen handle.
Is taking a MD5 sum still suitable for checking for file dupes? I know that it isn't secure, but does that really matter in the case of trying to find file dupes?
Should I be using something in the SHA family instead?
What is best practice in this use case?
In this particular case, choice of algorithm probably isn't that significant. The key reasons for using SHA1 over MD5 all relate to creating cryptographically secure signatures.
MD5 should be perfectly acceptable for this task, as you probably don't need to worry about people maliciously crafting files to generate false duplicates.
If you care about performances I think it would be better to check for matching file size first, then using a fast hash function (CRC32 or MD5 which should be faster than SHA1) and for possible duplicated files found this way trying with MD5, SHA1 or SHA256 (depending on the criticality of the task).
SHA1 is slightly better as a checksum than MD5. It is what Git uses.
MD5 has known vulnerabilities at this point, but that may not be a problem for your application. It's still reasonably good for distinguishing piles of bits. If something comes up with no match, then you know you haven't already seen it, since the algorithm is deterministic. If something comes back as a match, you should actually compare it to the blob that it ostensibly matched before acting as if it's really a duplicate. MD5 is relatively fast, but if you can't afford full-text comparisons on hash collisions, you should probably use a stronger hash, like SHA-256.
For the describe purpose there is no real preferable solution, both hash-functions will solve the problem. Anyway, MD5 will usually be slightly faster than SHA1.
Example in python:
#!/usr/bin/env python
import hashlib, cProfile
def repeat(f, loops=10000000):
def wrapper():
for i in range(loops): f()
return wrapper
#repeat
def test_md5():
md5 = hashlib.md5(); md5.update("hello"); md5.hexdigest()
#repeat
def test_sha1():
sha = hashlib.sha1(); sha.update("hello"); sha.hexdigest()
cProfile.run('test_md5()')
cProfile.run('test_sha1()')
#
# 40000004 function calls in 59.841 CPU seconds
#
# ....
#
# 40000004 function calls in 65.346 CPU seconds
#
# ....
What you are talking about is a checksum, which is related to (but not the same) as a cryptographic hash.
Yes, both MD5 and even CRC work just fine as checksums, as long as you are not concerned with a malicious user intentionally crafting two different files with the same checksum. If that is a concern, use SHA1 or, even better, some cryptographically unbroken hash.
While MD5 does have a few collisions, I've always used it for files and it's worked just fine.
We use MD5 at my work for exactly what you're considering. Works great. We only need to detect duplicates uploads on a per-customer basis, which reduces our exposure to the birthday problem, but md5 would still be sufficient for us if we had to detect duplicates across all uploads rather than per customer. If you can believe the internet, the probability p of a collision given n samples and a hash size of b is bounded by:
p <= n (n - 1) / (2 * 2 ^ b)
A few years back I ran this calculation for n = 10^9 and b = 128 and came up with p <= 1.469E-21. To put that in perspective, 10^9 files is one per second for 32 years. So we don't compare files in the event of a collision. If md5 says the uploads were the same, they're the same.