MySQL: SELECTing by hash: is this possible? - mysql

I don't think it has too much sense. Although, this way you could hide the real static value from .php file, but keeping its hash value in php file for mysql query. The source of php file can't be reached from user's machine, but you have make backups of your files, and that static value is there. Selecting using hash of column would resolve this problem, I believe.
But, I didn't find any examples or documentation saying that it's possible to use such functions in queries (not for values in sql queries, but for columns to select).
Is this possible?

An extremely slow query that simply selects all rows with an empty "column".
SELECT * FROM table WHERE MD5(column) = 'd41d8cd98f00b204e9800998ecf8427e'
If you're doing a lot of these queries, consider saving the MD5 hash in a column or index. Even better would be to do all MD5 calculations on the script's end - the day you're going to need an extra server for your project you'll notice that webservers scale a lot better than database servers. (That's something to worry about in the future, of course)

It should be noted that setting up your system this way won't actually solve any problem in your particular case. You are not making your system more secure doing this, you are just making it more convoluted.
The standard way to hide secret values from the source base is to store these secret values in a separate file, and never submit that file to source control or make a backup of it. Load the value of the secret by using php code and then work with the value directly in MySQL (one way to do this is to store a "config.php" file or something along that lines that just sets variables/constants, and then just php-include the file).
That said, I'll answer your question anyway.
MySQL actually has a wide-variety of hashing and encryption functions. See http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html
Since you tagged your question md5 I'm assuming the function you're looking for is MD5: http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html#function_md5
You select it just like this:
SELECT MD5(column) AS hashed_column FROM table
Then the value to compare to the hash will be in the column alias 'hashed_column'.
Or to select a particular row based on the hash:
SELECT * FROM table WHERE MD5(column) = 'hashed-value-to-compare'

If I understand correctly, you want to use a hash as a primary key:
INSERT INTO MyTable (pk) VALUES (MD5('plain-value'));
Then you want to retrieve it by hash without knowing what its hash digest is:
SELECT * FROM MyTable WHERE pk = MD5('plain-value');
Somehow this is supposed to provide greater security in case people steal a backup of your database and PHP code? Well, it doesn't. If I know the original plain-value and the method of hashing, I can find the data just as easily as if you didn't hash the value.
I agree with the comment from #scunliffe -- we're not sure exactly what problem you're trying to solve, but it sounds like this method will not solve it.
It's also inefficient to use an MD5 hash digest as a primary key. You have to store it in a CHAR(32), or else UNHEX it and store it in BINARY(16). Regardless, you can't use INT or even BIGINT as the primary key datatype. The key values are more bulky, and therefore make larger indexes.
Also new rows will insert in an arbitrary location in the clustered index. That's more expensive than adding new values to the end of the B-tree, as you would do if you used simple auto-incrementing integers like everyone else.

Related

Dedicated SQL table containing only unique strings

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!
I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.
This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties
If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.

MySQL - Working with Encrypted Columns

I have some tables with encrypted fields. After looking through the MySQL docs, I found that you can't create a custom datatype for encryption / decryption, which would be ideal. So, instead, I have a view similar to the one below:
CREATE VIEW EMPLOYEE AS
SELECT ID, FIRST_NAME, LAST_NAME, SUPER_SECURE_DECRYPT(SSN) AS SSN
FROM EMPLOYEE_ENCRYPTED
Again, after reading through the MySQL documentation, I've learned that the view isn't insertable because it has a derived column and the SSN field isn't updatable since it's a derived column, which makes sense. However, you can't add a trigger to a view so writing to the view is not really an option.
In an attempt to get around this, I've created a couple of triggers similiar to:
CREATE TRIGGER EMPLOYEE_ENCRYPTED_UPDATE
BEFORE UPDATE ON EMPLOYEE_ENCRYPTED FOR EACH ROW
BEGIN
IF NEW.SSN <> OLD.SSN THEN
SET NEW.SSN = SUPER_SECURE_ENCRYPT(NEW.SSN);
END IF;
END;
as well as one for inserting (which, since it's so similar, I'm not going to post it). This simply means I have to read from the view and write to the table.
This is a decent solution except that when you supply a where clause for the update statement that is querying the encrypted column (as in, update an employee by their SSN). Typically, this isn't an issue since I normally use the primary key for updates but I need to know for other encrypted fields if there's a way to do this.
I want to make MySQL do the heavy lifting for encryption and decryption so that it can be as frictionless as possible to work with as a developer. I would like the application developer to not have to worry about encrypted fields as much as possible while still using encrypted fields, that's the ultimate goal here. Any help or advice is appreciated.
It is diffuclt to answer your question without knowing the type of encryption you are using. It it's a standard encryption/hashing such as MD5, you can use that directly in MySQL with a WHERE ssn=MD5('ssnStr') type of clause, tho MD5 isn't meant for decryption. Otherwise, if it's some sort of customized encryption, you'll have to either
1) create a procedure that performs the encryption/decryption and use that in your WHERE clause
or
2) perform the encryption before hand and use its result to match the condition you desire in your WHERE clause or wherever in your query.
It may be best to supply your query with the encrypted value for the SSN and use that to match to your field. If you have to perform some sort of decryption for each row in your DB, this won't be efficient at all. In other words, supply your query with input that directly matches the data stored for best performance.

Getting an Unique Identifier without Inserting

I'm looking for the best way to basically get a unique number (guaranteed, not some random string or current time in milliseconds & of a reasonable length about 8 characters) using MySQL (or other ways suggestions welcome).
I just basically want to run some SELECT ... statement and have it always return a unique number with out any inserting into database. Just something that increments some stored value and returns the value and can handle a lot of requests concurrently, without heavy blocking of the application.
I know that I can make something with combinations of random numbers with higher bases (for shorter length), that could make it very unlikely that they overlap, but won't guarantee it.
It just feels like there should be some easy way to get this.
To clarify...
I need this number to be short as it will be part of a URL and it is ok for the query to lock a row for a short period of time. What I was looking for is maybe some command that underhood does something like this ...
LOCK VALUE
RETURN VALUE++
UNLOCK VALUE
Where the VALUE is stored in the database, a MySQL database maybe.
You seek UUID().
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
mysql> SELECT UUID();
-> '6ccd780c-baba-1026-9564-0040f4311e29'
It will return a 128-bit hexadecimal number. You can munge as necessary.
Is the unique number to be associated with a particular row in a table? If not, why not call rand(): select rand(); The value returned is between zero and one, so scale as desired.
Great question.
Shortest answer - that is simply not possible according to your specifications.
Long answer - the closest approach to this is MySQL's UUID but that is neither short, nor is sortable (ie: a former UUID value to be greater/smaller than a previous one).
To UUID or not to UUID? is a nice article describing pros and cons regarding their usage, touching also some of the reasons of why you can't have what you need
I am not sure I understand exactly, maybe something like this:
SELECT ROUND(RAND() * 123456789) as id
The larger you make the number, the larger your id.
No guarantees about uniqueness of course, this is a quick hack after all and you should implement a check in code to handle the off chance a duplicate is inserted, but maybe this would serve your purpose?
Of course, there are many other approaches possible to do this.
You can easily use most any scripting language to generate this for you, php example here:
//Generates a 32 character identifier that is extremely difficult to predict.
$id = md5(uniqid(rand(), true));
//Generates a 32 character identifier that is extremely difficult to predict.
$id = md5(uniqid(rand(), true));
Then use $id in your query or whatever you need your unique id in. In my opinion, the advantage of doing this in a scripting language when interacting with a DB is that it is easier to validate for application / usage purposes and act accordingly. For instance, in your example, whatever method you use, if you wanted to be 100% always sure of data integrity, you have to make sure there are no duplicates of that id elsewhere. This is easier to do in a script than in SQL.
Hope that helps my friend, good-luck!

do I need to store domain names to md5 mode in database?

I had a feeling that searching domain names taking time more than as usual in mysql. actually domain name column has a unique index though query seems slow.
My question is do I need to convert to binary mode?? say md5 hash or something??
Normally keeping the domain names in a "VARCHAR" data type, with an UNIQUE index defined for that field, is the most simple & efficient way of managing your data.
Never try to use any complexity (like using Binary mode or "BLOB" data type), for the sake of one / two field(s), as it will further deteriorate your MySQL performance.
Hope it helps.

MySql Performance Question: MD5(value)

for security purpose I do some queries in this way:
SELECT avatar_data FROM users WHERE MD5(ID) ='md5value'
So, for example I have this entries:
-TABLE.users-
ID | avatar_data
39 | some-data
I do this query:
SELECT avatar_data FROM users WHERE MD5(ID) ='d67d8ab4f4c10bf22aa353e27879133c'
'd67d8ab4f4c10bf22aa353e27879133c' is the '39' value filtered by MD5.
I have a VERY large database with a lot of entries. I wonder if this approach might compromise the DB performance?
Because you are using a function on the column you want to search ( MD5(ID)= ), MySQL will have to do a full table scan.
While I am not sure your reason for doing a search like that, to speed things up, I can suggest you add another column with the processed ID data and index it.
So you should do:
SELECT * FROM user WHERE MD5_ID =
'd67d8ab4f4c10bf22aa353e27879133c'
With that query and without functional indexes, yes you would table-scan the whole thing. If you do that often, you may want to pre-compute the digest into a surrogate table or in another column, index and lookup directly.
Yes that would probably get very slow and it really doesn't add any security. MD5 of '39' is pretty easy to figure out. For a one way hash to be successful it needs to contain values that would be unknown to an attacker. Otherwise the attacker is just going to hash the value and you've not really accomplished anything.
You might consider posting more about what you're doing. For example: is this a web administration tool? Is it password protected? Etc.
if you want this kind of security you probably be better out if you save the passwords as a md5 hash. encoding id's dont realy give security