MySQL using MATCH AGAINST for long unique values (8.0.27) - mysql

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.
Sample Entries (actual data from different sources I need to match):
AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==
AND
<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=#t2.msgid.quoramail.com>
AND
["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]
So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?
What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?

Here's a summary of the discussion in comments:
The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.
The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.
You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.
Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.
MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.
The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.

Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.
Now a simple BTree lookup plus Join will solve all your plans.
TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.

Related

Store and query array or group of words in MYSQL and PHP

I am working on a project that uses PHP/MYSQL as the backend for an IOS app that makes a lot of use of dictionaries and arrays containing text or strings.
I need to store this text in MYSQL (coming from Arrays of srtrings on phone) and then query to see the text contains (case insensitive) a word or phrase in question.
For example, if the array consists of {Ford, Chevy, Toyota, BMW, Buick}, I might want to query it to see it contains Saab.
I know storing arrays in a field is not MYSQL friendly as it prevents optimization. However, it would be way too complicated to create individual tables for these collections of words which are created by users.
So I'm looking for a reasonable way to store them, perhaps delimited with spaces or with commas that makes possible reasonably efficient searches.
If they are stored separated by spaces, I gather you can do something with regex like:
SELECT
*
FROM
`wordgroups`
WHERE
wordgroup regexp '(^|[[:space:]])BLA([[:space:]]|$)';
But this seems funky.
Is there a better way to do this? Thanks for any insights
Consider using a FULLTEXT index. And use MATCH(...) AGAINST(... IN NATURAL LANGUAGE MODE).
FULLTEXT is very fast for "words", and IN NATURAL MODE may solve your Saab example.
Using regexp can achieve what you want, however, your query will be inefficient, since it cannot rely on any indexes.
If you want to store a list of words and their position within the array does not matter, then you may consider storing them in a single field, space delimited. But instead of using a regexp, use fulltext indexing and searching. This method has a clear advantage over searching with regexp: it uses an index. It has some drawbacks as well: there is a stopword list (these are excluded from searching) and there is a minimum word length as well. The good news is that these parameters are configurable. Also, you get all the drawbacks of storing data in a delimited field, as detailed in Is storing a delimited list in a database column really that bad? question here on SO.
However, if you want to use dictionaries (key - value pairs) or the position within the list may be important, then the above data structure will not do.
In this case, I would consider if mysql is the right choice for storing my data in the first place. If you have multi-dimensional lists, or lists containing lists, then I would definitely choose a different nosql solution.
If you only need simple, two-dimensional lists / dictionaries, then you can store all of them in a single table with a similar structure as below:
list_id - unique identifier of the list, primary key
user_id - id of the user the list belongs to
key - for dictionaries this is the lookup field (indexed), for other lists it may store the position of the element. String data type.
value - the field holding the value (indexed). Data type should be string, so that it could hold different data types as well.
A search to determine if a list holds a certain value would be fast and efficient lookup using the index on either the key or value fields.

MySQL schema for table of hashes

I need to store, query and update a large amount of file hashes. What would be the optimal mysql schema for this kind of table? Should I use a hash index eg.
CREATE INDEX hash_index on Hashes(id) using HASH;
can I reuse the PK hash for the index ? (as I understand, the "using hash" will create a hash from the hash)
File hashes are fixed-length data items (unless you change the hash type after you have created some rows). If you represent your file hashes in hexadecimal or Base 64, they'll have characters and digits in them. For example, sha-256 hashes in hex take 64 characters (four bits in each character).
These characters are all 8-bit characters, so you don't need unicode. If you're careful about filling them in, you don't need case sensitivity either. Eliminating all these features of database columns makes the values slightly faster to search.
So, make your hashes fixed-length ASCII columns, using ddl like this:
hash CHAR(64) COLLATE 'ascii_bin'
You can certainly use such a column as a primary key.
Raymond correctly pointed out that MySQL doesn't offer hash indexes except for certain types of tables. That's OK: ordinary BTREE indexes work reasonably well for this kind of information.

Searching in mysql table on email(unique) column

I have several questions in mind,
1) Is searching on int key is faster than searching on string key?
Relevance of my second question depends totally on answer of first,
If yes,
2) I have a table which have a [1-5]billion records, having a unique email column. I am planning to have one more column that will store hashcode(int) of email(string). Whenever I want a record with given email, I will search records with the hashcode of the email and then match the exact email.
How effective will be second? Please suggest if there is any better alternative available.
A CPU can compare an integer faster than a string. Strings are represented as ASCII encoded integers in memory, so to compare a string the program must first convert and compare each character before a conclusion is returned. In MYSQL, if you have a UNIQUE column combined with a fixed length VARCHAR, the search time will be very fast because the mysql engine will build a tree and use that to search for the key. Without those two, the mysql engine must compare each email row to the search criteria. MySQL has advanced through the years and there are lots of build in mechanisms that can be leveraged to make database management extremely fast.

Indexes on BLOBs that contain encrypted data

I have a bunch of columns in a table that are of the type BLOB. The data that's contained in these columns are encrypted with MySQL's AES_ENCRYPT() function. Some of these fields are being used in a search section of an application I'm building. Is it worth it to put indexes on the columns that are being frequently accessed? I wasn't sure if the fact that they are BLOBs or the fact that the data itself is encrypted would make an index useless.
EDIT: Here are some more details about my specific case. There is a table with ~10 columns or so that are each BLOBs. Each record that is inserted into this table will be encrypted using the AES_ENCRYPT() function. In the search portion of my application users will be able to type in their query. I take their query and decrypt it like this SELECT AES_DECRYPT(fname MYSTATICKEY) AS fname FROM some_table so that I can perform a search using a LIKE clause. What I am curious about is if the index will index the encrypted data and not the actual data that is returned from the decryption. I am guessing that if the index applied to only the encrypted binary string then it would not help performance at all. Am I wrong on that?
Note the following:
You can't add an index of type FULLTEXT to a BLOB column (http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html)
Therefore, you will need to use another type of index. For BLOBs, you will have to specify a prefix length (http://dev.mysql.com/doc/refman/5.0/en/create-index.html) - the length will depend on the storage engine (e.g. up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables). Therefore, unless the values you are storing are short you won't be able to index all the data.
AES_ENCRYPT() encrypts a string and returns a binary string. This binary string will be the value that is indexed.
Therefore, IMO, your guess is right - an index won't help the performance of your searches.
Note that 'indexing an encrypted column' is a fairly common problem - there's quite few articles online about it. For example (although this is quite old and for MS SQL it does cover some ideas): http://blogs.msdn.com/b/raulga/archive/2006/03/11/549754.aspx
Also see: What's the best way to store and yet still index encrypted customer data? (the top answer links to the same article I found above)

Does fixed preprendet string in a MySql indexed table cause performance issues?

I got a MySql InnoDB Table containing a source field with about one Billion rows. All the source field values are urls, so they all start with http:// (No https).
Does it increase the select performance on the source field if I remove all the http:// start from the values?
It depends.
I assume you have an index on your source field. Indexes on varchar fields in MySQL only work on prefixes, i.e. they can only be used when searching for either the whole value (... where source = "some value") or a substring of the value starting at position 0 (... WHERE source LIKE "some value%"). If you query for arbitrary substrings (i.e. ... WHERE source LIKE "%some value%"), MySQL cannot use the index.
When creating an index on a varchar or text column, you can optionally specify an index length (KEY indexName (source(10))). If you do, the index will only cover (in this example) the leftmost 10 characters of the URL. If you don't specify an index length, the whole field value is indexed - this makes the index larger, but more selective (index selectivity is the number of different values in your index divided by the total number of indexed values. The closer this ratio is to 1, the better). If you're using a TEXT or BLOB type, an index length is required. Now, if you have an index, have set an index length and query for a URL prefix as described above, then yes, removing "http://" from the URLs will make your index more selective and thus faster. How much faster depends on your data, the index length, and how much more selective your index becomes, so you should really measure it. I doubt, though, that it will ultimately make much difference, and if it does, you might gain much more by tinkering with the index.
If you don't query for URL prefixes or complete URLs, you might want to preprocess your URLs to be able to create an index that works with your query. If you don't have an index at all, then making an effective one should be your first optimization step.