Indexes on BLOBs that contain encrypted data

Indexes on BLOBs that contain encrypted data - mysql

I have a bunch of columns in a table that are of the type BLOB. The data that's contained in these columns are encrypted with MySQL's AES_ENCRYPT() function. Some of these fields are being used in a search section of an application I'm building. Is it worth it to put indexes on the columns that are being frequently accessed? I wasn't sure if the fact that they are BLOBs or the fact that the data itself is encrypted would make an index useless.
EDIT: Here are some more details about my specific case. There is a table with ~10 columns or so that are each BLOBs. Each record that is inserted into this table will be encrypted using the AES_ENCRYPT() function. In the search portion of my application users will be able to type in their query. I take their query and decrypt it like this SELECT AES_DECRYPT(fname MYSTATICKEY) AS fname FROM some_table so that I can perform a search using a LIKE clause. What I am curious about is if the index will index the encrypted data and not the actual data that is returned from the decryption. I am guessing that if the index applied to only the encrypted binary string then it would not help performance at all. Am I wrong on that?

Note the following:
You can't add an index of type FULLTEXT to a BLOB column (http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html)
Therefore, you will need to use another type of index. For BLOBs, you will have to specify a prefix length (http://dev.mysql.com/doc/refman/5.0/en/create-index.html) - the length will depend on the storage engine (e.g. up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables). Therefore, unless the values you are storing are short you won't be able to index all the data.
AES_ENCRYPT() encrypts a string and returns a binary string. This binary string will be the value that is indexed.
Therefore, IMO, your guess is right - an index won't help the performance of your searches.
Note that 'indexing an encrypted column' is a fairly common problem - there's quite few articles online about it. For example (although this is quite old and for MS SQL it does cover some ideas): http://blogs.msdn.com/b/raulga/archive/2006/03/11/549754.aspx
Also see: What's the best way to store and yet still index encrypted customer data? (the top answer links to the same article I found above)

Related

MySQL using MATCH AGAINST for long unique values (8.0.27)

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.
Sample Entries (actual data from different sources I need to match):
AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==
AND
<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=#t2.msgid.quoramail.com>
AND
["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]
So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?
What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?

Here's a summary of the discussion in comments:
The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.
The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.
You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.
Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.
MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.
The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.

Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.
Now a simple BTree lookup plus Join will solve all your plans.
TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.

Is it correct to have a BLOB field directly in the main table?

Which one is better: having a BLOB field in the same table or having a 1-TO-1 reference to it in another table?
I'm making a MySQL database whose main table is called item(ID, Description). This table is consulted by a program I'm developing in VB.NET which offers the possibility to double-click a specific item obtained with a query. Once opened its dedicated form, I would like to show an image stored in the BLOB field, a sort of item preview. The problem is I don't know where is better to create this BLOB field.
Assuming to have a table like this: Item(ID, Description, BLOB), will the BLOB field affect the database performance on queries like:
SELECT ID, Description FROM Item;
If yes, what do you think about this solution:
Item(ID, Description)
Images(Item, File)
Where Images.Item references to Item.ID, and File is the BLOB field.

You can add the BLOB field directly to your main table, as BLOB fields are not stored in-row and require a separate look-up to retrieve its contents. Your dependent table is needless.
BUT another and preferred way is to store on your database table only a pointer (path to the file on server) to your image file. In this way you can retrive the path and access the file from your VB.NET application.

To quote the documentation about blobs:
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
In simpler terms, the blob's storage isn't stored inside the table's row, only a pointer is - which is pretty similar to what you're trying to achieve with the secondary table. To make a long story short - there's no need for another table, MySQL already doesn't the same thing internally.

Most of what has been said in the other Answers is mostly correct. I'll start from scratch, adding some caveats.
The two-table, 1-1, design is usually better for MyISAM, but not for InnoDB. The rest of my Answer applies only to InnoDB.
"Off-record" storage may happen to BLOB, TEXT, and 'large' VARCHAR and VARBINARY, almost equally.
"Large" columns are usually stored "off-record", thereby providing something very similar to your 1-1 design. However, by having InnoDB do the work usually leads to better performance.
The ROW_FORMAT and the size of the column makes a difference.
A "small" BLOB may be stored on-record. Pro: no need for the extra fetch when you include the blob in the SELECT list. Con: clutter.
Some ROW_FORMATs cut off at 767 bytes.
Some ROW_FORMATs store 20 bytes on-record; this is just a 'pointer'; the entire blob is off-record.
etc, etc.
Off-record is beneficial when you need to filter out a bunch of rows, then fetch only a few. Also, when you don't need the column.
As a side note, TINYTEXT is possibly useless. There are situations where the 'equivalent' VARCHAR(255) performs better.
Storing an image in the table (on- or off-record) is arguably unwise if that image will be used in an HTML page. HTML is quite happy to request the <img src=...> from your server or even some other server. In this case, a smallish VARCHAR containing a url is the 'correct' design.

MySQL DB normalization

I've got a single table DB with 100K rows. There are about 30 columns and 28 of them are varchars / tiny text and one of them is an int primary key and one of them is a blob.
My question, is in terms of performance, would it be better to separate the blob from the rest of the table and store them in their own table with foreign key constraint to the primary id?
The table will eventually be turned into a sqlite persistent store for iOS core data and a lot of the searching / filtering will be done based on the NSPredicate for the lighter varchar columns.
Sorry if this is too subjective, but I'm thinking there is a recommended way.
Thanks!

If you do SELECT * FROM table (which you shouldn't if you don't need the BLOB field actually) then yes, the query will be faster because in that case pages with BLOB won't be touched.
If you do frequent SELECT f1, f2, f3 FROM table (all fields are non-BLOBs) then yes, storing BLOBS in a separate table will make the query faster because of the same reason - MySQL will have to read less pages.
If however the BLOB is selected frequently then it makes no sense to keep it separately.

This totally depends on data usage.
If you need the data everytime you query the table, there is no difference in haviong a separate table for it (as long as blob data is unique in each row - that is, "as long as the database is normalized").
If you don'T need the blob data but only metadata from other columns, there may be a speed bonus qhen querying if the blob has its own table. querying the blob data is slower thoguh, as you need to query bowth tables.
The USUAL way is not to store any blob data inside the database (at least not huge data), but store the binary data into files and have the fiel path inside the database instead. This is recommended, as binary data most likely doesn'T benefit from being inside a DBMS (not indexable, sortable, groupable, ..), so there is no drawback of storing it inside files, while the database isn't optimized for binary data ('cause, again, it can't do much with it anyway).

Blobs are stored on disk only the point to the storage is in memory in Mysql. Moving it to another table with a foreign key will not noticeably help your performance. Don't know if this is the case for sqlite.

How to choose optimized datatypes for columns [innodb specific]?

I'm learning about the usage of datatypes for databases.
For example:
Which is better for email? varchar[100], char[100], or tinyint (joking)
Which is better for username? should I use int, bigint, or varchar?
Explain. Some of my friends say that if we use int, bigint, or another numeric datatype it will be better (facebook does it). Like u=123400023 refers to user 123400023, rather then user=thenameoftheuser. Since numbers take less time to fetch.
Which is better for phone numbers? Posts (like in blogs or announcments)? Or maybe dates (I use datetime for that)? maybe some have make research that would like to share.
Product price (I use decimal(11,2), don't know about you guys)?
Or anything else that you have in mind, like, "I use serial datatype for blablabla".
Why do I mention innodb specifically?
Unless you are using the InnoDB table
types (see Chapter 11, "Advanced
MySQL," for more information), CHAR
columns are faster to access than
VARCHAR.
Inno db has some diffrence that I don't know.
I read that from here.

Brief Summary:
(just my opinions)
for email address - VARCHAR(255)
for username - VARCHAR(100) or VARCHAR(255)
for id_username - use INT (unless you plan on over 2 billion users in you system)
phone numbers - INT or VARCHAR or maybe CHAR (depends on if you want to store formatting)
posts - TEXT
dates - DATE or DATETIME (definitely include times for things like posts or emails)
money - DECIMAL(11,2)
misc - see below
As far as using InnoDB because VARCHAR is supposed to be faster, I wouldn't worry about that, or speed in general. Use InnoDB because you need to do transactions and/or you want to use foreign key constraints (FK) for data integrity. Also, InnoDB uses row level locking whereas MyISAM only uses table level locking. Therefore, InnoDB can handle higher levels of concurrency better than MyISAM. Use MyISAM to use full-text indexes and for somewhat less overhead.
More importantly for speed than the engine type: put indexes on the columns that you need to search on quickly. Always put indexes on your ID/PK columns, such as the id_username that I mentioned.
More details:
Here's a bunch of questions about MySQL datatypes and database design (warning, more than you asked for):
What DataType should I pick?
Table design question
Enum datatype versus table of data in MySQL?
mysql datatype for telephne number and address
Best mysql datatype for grams, milligrams, micrograms and kilojoule
MySQL 5-star rating datatype?
And a couple questions on when to use the InnoDB engine:
MyISAM versus InnoDB
When should you choose to use InnoDB in MySQL?
I just use tinyint for almost everything (seriously).
Edit - How to store "posts:"
Below are some links with more details, but here's the short version. For storing "posts," you need room for a long text string. CHAR max length is 255, so that's not an option, and of course CHAR would waste unused characters versus VARCHAR, which is variable length CHAR.
Prior to MySQL 5.0.3, VARCHAR max length was 255, so you'd be left with TEXT. However, in newer versions of MySQL, you can use VARCHAR or TEXT. The choice comes down to preference, but there are a couple differences. VARCHAR and TEXT max length is now both 65,535, but you can set you own max on VARCHAR. Let's say you think your posts will only need to be 2000 max, you can set VARCHAR(2000). If you every run into the limit, you can ALTER you table later and bump it to VARCHAR(3000). On the other hand, TEXT actually stores its data in a BLOB (1). I've heard that there may be performance differences between VARCHAR and TEXT, but I haven't seen any proof, so you may want to look into that more, but you can always change that minor detail in the future.
More importantly, searching this "post" column using a Full-Text Index instead of LIKE would be much faster (2). However, you have to use the MyISAM engine to use full-text index because InnoDB doesn't support it. In a MySQL database, you can have a heterogeneous mix of engines for each table, so you would just need to make your "posts" table use MyISAM. However, if you absolutely need "posts" to use InnoDB (for transactions), then set up a trigger to update the MyISAM copy of your "posts" table and use the MyISAM copy for all your full-text searches.
See bottom for some useful quotes.
MySQL Data Type Chart (outdated)
MySQL Datatypes (outdated)
Chapter 10. Data Types (better details)
The BLOB and TEXT Types (1)
11.9. Full-Text Search Functions (2)
10.4.1. The CHAR and VARCHAR Types (3)
(3) "Values in VARCHAR columns are
variable-length strings. The length
can be specified as a value from 0 to
255 before MySQL 5.0.3, and 0 to
65,535 in 5.0.3 and later versions.
Before MySQL 5.0.3, if you need a data
type for which trailing spaces are not
removed, consider using a BLOB or TEXT
type.
When CHAR values are stored, they are
right-padded with spaces to the
specified length. When CHAR values are
retrieved, trailing spaces are
removed.
Before MySQL 5.0.3, trailing spaces
are removed from values when they are
stored into a VARCHAR column; this
means that the spaces also are absent
from retrieved values."
Lastly, here's a great post about the pros and cons of VARCHAR versus TEXT. It also speaks to the performance issue:
VARCHAR(n) Considered Harmful

There are multiple angles to approach your question.
From a design POV it is always best to chose the datatype which expresses the quantity you want to model best. That is, get the data domain and data size right so that illegal data cannot be stored in the database in the first place. But that is not where MySQL is strong in the first place, and especially not with the default sql_mode (http://dev.mysql.com/doc/refman/5.1/en/server-sql-mode.html). If it works for you, try the TRADITIONAL sql_mode, which is a shorthand for many desireable flags.
From a performance POV, the question is entirely different. For example, regarding the storage of email bodies, you might want to read http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/ and then think about that.
Removing redundancies and having short keys can be a big win. For example, in a project that I have seen, a log table has been storing http User-Agent information. By simply replacing each user agent string in the log table with a numeric id of a user agent string in a lookup table, data set size was considerably (more than 60%) reduced. By parsing the user agent further and then storing a bunch of ids (operating system, browser type, version index) data set size was reduced to 1% of the original size.
Finally, there is a number of rules that can help you spot errors in schema design.
For example, anything that has id in the name and is not an unsigned integer type is probably a bug (especially in the context of innodb).
For example, anything that has price or cost in the name and is not unsigned is a potential source of fraud (fraudster creates article with negative price, and buys that).
For example, anything that works on monetary data and is not using the DECIMAL data type of the appropriate size is probably doing math wrong (DECIMAL is doing BCD, decimal paper math with correct precision and rounding, DOUBLE and FLOAT do not).

SQLyog has Calculate optimal datatype feature which helps in finding out optimal datatype based on records inserted in a table.
It uses
SELECT * FROMtable_name` PROCEDURE ANALYSE(1, 10);
query to find out optimal datatype

do I need to store domain names to md5 mode in database?

I had a feeling that searching domain names taking time more than as usual in mysql. actually domain name column has a unique index though query seems slow.
My question is do I need to convert to binary mode?? say md5 hash or something??

Normally keeping the domain names in a "VARCHAR" data type, with an UNIQUE index defined for that field, is the most simple & efficient way of managing your data.
Never try to use any complexity (like using Binary mode or "BLOB" data type), for the sake of one / two field(s), as it will further deteriorate your MySQL performance.
Hope it helps.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008