MySQL - Working with Encrypted Columns - mysql

I have some tables with encrypted fields. After looking through the MySQL docs, I found that you can't create a custom datatype for encryption / decryption, which would be ideal. So, instead, I have a view similar to the one below:
CREATE VIEW EMPLOYEE AS
SELECT ID, FIRST_NAME, LAST_NAME, SUPER_SECURE_DECRYPT(SSN) AS SSN
FROM EMPLOYEE_ENCRYPTED
Again, after reading through the MySQL documentation, I've learned that the view isn't insertable because it has a derived column and the SSN field isn't updatable since it's a derived column, which makes sense. However, you can't add a trigger to a view so writing to the view is not really an option.
In an attempt to get around this, I've created a couple of triggers similiar to:
CREATE TRIGGER EMPLOYEE_ENCRYPTED_UPDATE
BEFORE UPDATE ON EMPLOYEE_ENCRYPTED FOR EACH ROW
BEGIN
IF NEW.SSN <> OLD.SSN THEN
SET NEW.SSN = SUPER_SECURE_ENCRYPT(NEW.SSN);
END IF;
END;
as well as one for inserting (which, since it's so similar, I'm not going to post it). This simply means I have to read from the view and write to the table.
This is a decent solution except that when you supply a where clause for the update statement that is querying the encrypted column (as in, update an employee by their SSN). Typically, this isn't an issue since I normally use the primary key for updates but I need to know for other encrypted fields if there's a way to do this.
I want to make MySQL do the heavy lifting for encryption and decryption so that it can be as frictionless as possible to work with as a developer. I would like the application developer to not have to worry about encrypted fields as much as possible while still using encrypted fields, that's the ultimate goal here. Any help or advice is appreciated.

It is diffuclt to answer your question without knowing the type of encryption you are using. It it's a standard encryption/hashing such as MD5, you can use that directly in MySQL with a WHERE ssn=MD5('ssnStr') type of clause, tho MD5 isn't meant for decryption. Otherwise, if it's some sort of customized encryption, you'll have to either
1) create a procedure that performs the encryption/decryption and use that in your WHERE clause
or
2) perform the encryption before hand and use its result to match the condition you desire in your WHERE clause or wherever in your query.
It may be best to supply your query with the encrypted value for the SSN and use that to match to your field. If you have to perform some sort of decryption for each row in your DB, this won't be efficient at all. In other words, supply your query with input that directly matches the data stored for best performance.

Related

Batch Set all MySQL columns to all NULL

I have a large database w/ a bunch of tables and columns are mixed some allowing NULL while others not allowing NULL..
I just recently decided to STANDARDIZE my methods and USE NULL for all empty fields etc.. therefore i need to set ALL COLUMNS in ALL my tables to allow NULL (except for primaries ofcourse)
I can whip up a php code to loop this , but i was wondering if there's a quick way to do it via SQL?
regards
You can use meta data from system tables to determine your tables, columns, types etc. And then using that, dynamically build a string that contains your UPDATE SQL, with table and column names concatented in to it. This is then executed.
I've recently posted a solution that allowed the OP to search through columns looking for those that contain a particular value. In lieu of anyone providing a more complete answer, this should give you some clues about how to approach this (or at least, what to research). You'd need to either provide table names, or join to them, and then do something similar as this except you'd be checking type, not value (and the dynamic SQL you build would build an update, not a select).
I will be in a position to help you with your specific scenario further in a few hours... If by then you've had no luck with this (or other answers) then I'll provide something more complete then.
EDIT: Just realised you've tagged this as mySql... My solution was for MS SQL Server. The principals should be the same (and hence I'll leave this answer up as i think youll find it usefull), assuming MySql allows you to query its metadata, and execute dynamically generated SQL commands.
SQL Server - Select columns that meet certain conditions?

Dedicated SQL table containing only unique strings

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!
I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.
This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties
If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.

MySQL: SELECTing by hash: is this possible?

I don't think it has too much sense. Although, this way you could hide the real static value from .php file, but keeping its hash value in php file for mysql query. The source of php file can't be reached from user's machine, but you have make backups of your files, and that static value is there. Selecting using hash of column would resolve this problem, I believe.
But, I didn't find any examples or documentation saying that it's possible to use such functions in queries (not for values in sql queries, but for columns to select).
Is this possible?
An extremely slow query that simply selects all rows with an empty "column".
SELECT * FROM table WHERE MD5(column) = 'd41d8cd98f00b204e9800998ecf8427e'
If you're doing a lot of these queries, consider saving the MD5 hash in a column or index. Even better would be to do all MD5 calculations on the script's end - the day you're going to need an extra server for your project you'll notice that webservers scale a lot better than database servers. (That's something to worry about in the future, of course)
It should be noted that setting up your system this way won't actually solve any problem in your particular case. You are not making your system more secure doing this, you are just making it more convoluted.
The standard way to hide secret values from the source base is to store these secret values in a separate file, and never submit that file to source control or make a backup of it. Load the value of the secret by using php code and then work with the value directly in MySQL (one way to do this is to store a "config.php" file or something along that lines that just sets variables/constants, and then just php-include the file).
That said, I'll answer your question anyway.
MySQL actually has a wide-variety of hashing and encryption functions. See http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html
Since you tagged your question md5 I'm assuming the function you're looking for is MD5: http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html#function_md5
You select it just like this:
SELECT MD5(column) AS hashed_column FROM table
Then the value to compare to the hash will be in the column alias 'hashed_column'.
Or to select a particular row based on the hash:
SELECT * FROM table WHERE MD5(column) = 'hashed-value-to-compare'
If I understand correctly, you want to use a hash as a primary key:
INSERT INTO MyTable (pk) VALUES (MD5('plain-value'));
Then you want to retrieve it by hash without knowing what its hash digest is:
SELECT * FROM MyTable WHERE pk = MD5('plain-value');
Somehow this is supposed to provide greater security in case people steal a backup of your database and PHP code? Well, it doesn't. If I know the original plain-value and the method of hashing, I can find the data just as easily as if you didn't hash the value.
I agree with the comment from #scunliffe -- we're not sure exactly what problem you're trying to solve, but it sounds like this method will not solve it.
It's also inefficient to use an MD5 hash digest as a primary key. You have to store it in a CHAR(32), or else UNHEX it and store it in BINARY(16). Regardless, you can't use INT or even BIGINT as the primary key datatype. The key values are more bulky, and therefore make larger indexes.
Also new rows will insert in an arbitrary location in the clustered index. That's more expensive than adding new values to the end of the B-tree, as you would do if you used simple auto-incrementing integers like everyone else.

MySQL: is there something like an internal record identifier for every record in a MySQL table?

I'm building a spreadsheet app using MySQL as storage, I need to identify records that are being updated client-side in order to save the changes.
Is there a way, such as some kind of "internal record identifier" (internal as in used by the database engine itself), to uniquely identify records, so that I'll be able to update the correct one?
Certainly, a SELECT query can be used to identify the record, including all the fields in the table, but obviously that has the downside of returning multiple records in most situations.
IMPORTANT: the spreadsheet app aims to work on ANY table, even ones tremendously poorly designed, without any keys, so solutions such as "define a field with an UNIQUE index and work with that" are not an option, table structure may be extremely variable and must not matter.
Many thanks.
AFAIK no such unique internal identifier (say, a simple row ID) exists.
You may maybe be able to run a SELECT without any sorting and then get the n-th row using a LIMIT. Under what conditions that is reliable and safe to use, a mySQL Guru would need to confirm. It probably never is.
Try playing around with phpMyAdmin, the web frontend to mySQL. It is designed to deal with badly designed tables without keys. If I remember correctly, it uses all columns it can get hold of in such cases:
UPDATE xyz set a = b WHERE 'fieldname' = 'value'
AND 'fieldname2' = 'value2'
AND 'fieldname3' = 'value3'
LIMIT 0,1;
and so on.
That isn't entirely duplicate-safe either, of course.
The only idea that comes to my mind is to add a key column at runtime, and to remove it when your app is done. It's a goose-bump-inducing idea, but maybe better than nothing.
MySQL has "auto-increment" numeric columns that you can add and even define as a primary key, that would give you a unique record id automatically generated by the database. You can query the last record id you just inserted with select LAST_INSERT_ID()
example from mysql's official documentation here
To my knowledge, MySQL lacks the implicit ROWID feature as seen in Oracle (and exists in other engines with their own syntax). You'll have to create your own AUTO_INCREMENT field.

What is the best method/options for expiring records within a database?

In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.
So my question is how best to expire the record.
I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause any time you only want the current items.
Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).
Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.
Are there other solutions or modifications of these that are better?
I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.
I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.
I like the expired_flag option over the date_expired option, if query speed is important to you.
I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.
I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.
In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.
There's a related post here that offers a few other options. Maybe the CDC option?
SQL Server history table - populate through SP or Trigger?
May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.
Some possible enumerated values to use, depending on your needs:
Active
Deleted
Suspended
InUse (Sort of a pseudo-locking mechanism)
Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.
I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.
There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).
Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.
None is "best" -- each responds to different requirements.
Here's a tidy summary.
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 4 is a variation on this moves the history to another table.
Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.
Type 3: The original record is modified to reflect the change.
In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.
You can read more about this if you search for "Slowly Changing Dimension".
http://en.wikipedia.org/wiki/Slowly_Changing_Dimension
A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.