Bitmasks vs. separate columns - mysql

I have a database of users and their permissions. For example, a row looks like this:
Name | sendMessages | receiveMessages | post | readPosts
------------+--------------+-----------------+------+----------
Jeff Atwood | 1 | 1 | 0 | 1
What's better for this situation, separate columns (as in the example) or a single column, containing a bitmask (in this case, 1101 translates to 0xD)?

tinyint(1) as boolean is usally the best way to go.
Doing queries with bitmask is not efficient as it cannot use index if it has to calculate it
or can get very nasty if you try to make use of index
Lets look at simple query
select * from tbl where sendMessages = 1 and readPosts = 1
With single column that would be:
select * from tbl where val&9 = 9
This is not really efficient as it has to do full table scan and calculation.
Lets try to rewrite the query so that it can make use of indexes.
This can be done by listing all possible values with IN:
select * from tbl where val in (9, 11, 13, 15)
Now imagine how would this query look if you want to do simple where readPosts = 1
However, if you list too much values mysql optimiser will still do full table scan

What about not using columns for permissions, but create a permissions table and a user-permissions link table?

It is best to store permissions as separate columns with the BIT data type. Most modern database engines optimize bit column storage:
for up to 8 bit columns in a table, the columns are stored in 1 single byte.
for 9 and up to 16 bit columns, the columns are stored in 2 bytes,
and so on
So it makes the most of the two options you list. Less storage required through bitmasking, while keeping the clarity with multiple columns. The engine takes care of it for you.

Using a bit mask you can not use the first or last bit 0 and (31 in 32 bet / 63 in 64 bit???) but easily searchable with pow(2, bit) & field = pow(2, bit)
If you need more than 1 field can cope with then you have to start using multiple fields and then you get into the pain of working out which field your bit is set in.
This can easily be overcome by a simple routine to return true or false given the bit you are looking for and the fields or row in question.
But for permissions it would be best as #CodeCaster said using a permissions table and a linking table.

Related

Can I leverage indexing on a column that has a variable value added to it?

I have a table with data like this:
# id parent sequence mz1_monoisotopic
90 0 SGVNHR 669.34
93 0 IEEIATDLK 1031.56
95 1 MDLILDDR 990.49
100 1 LEVSEELIEK 1188.64
My application treats this as a 'base table' from which new tables are generated by adding some value (defined by the user) to the column mz1_monoisotopic. For example:
SELECT * FROM my_table WHERE (mz1_monoisotopic + 55) BETWEEN 1000 and 1002
Since most of the lookups are done via the mz1_monoisotopic column, and many of my tables are millions upon millions of entries, I really need to increase search speed. Is it possible to leveraging indexing in this situation or am I out of luck? I learned about functional indexes, but the problem is that the added value can change.
Additionally, the index itself takes too long to create on a new table, so it almost seems like my only option is a full table search.
Yes, an ordinary index works fine. You don't even need an expression index.
Behold! The power of algebra!
SELECT * FROM my_table WHERE mz1_monoisotopic BETWEEN 1000 - 55 and 1002 - 55;
The value can change from query to query, and the index will still work fine. As long as the expressions on the right side evaluate to constant values, they can be used for an indexed lookup.

Is there a way to index on multiple values in a column in MySQL

Currently I have a column in my table which has a set of comma separated values. I am currently using it to filter the results. I am wondering if it would be possible to index on it and query directly using it.
My table is as below:
userId types
123 A, B, C
234 B, C
If I want to query a user which has types A and C, should get 123
If with B and C then 123, 234
EDIT: I am aware the problem can be solved by normalization. However my table is actually storing json and this field is a virtual column referencing a list. there are no relations used anywhere. We are facing an issue where querying by types was not considered and is now causing performance impact
First of all, you should normalize your table and remove the CSV data. Use something like this:
userId | types
123 | A
123 | B
123 | C
234 | B
234 | C
For the specific query you have in mind, you might choose:
SELECT userId
FROM yourTable
WHERE types IN ('A', 'C')
GROUP BY userId
HAVING MIN(types) <> MAX(types);
With this in mind, MySQL might be able to use the following composite index:
CREATE INDEX idx ON yourTable (userId, types);
This index should cover the entire query above actually.
The answer is basically no . . . but not really. The important point is that you should restructure the data and store it in a properly. And "properly" means that string columns are not used to store multiple values.
However, that is not your question. You can create an index to do what you want. Such an index would be a full-text index, allowing you to use match(). If you take this approach you need to be very careful:
You need to use boolean mode when querying.
You need to set the minimum word length so single characters are recognized as words.
You need to check the stop-words list, so words such as "A" and "I" are included.
So, what you want to do is possible. However, it is not recommended because the data in not in a proper relational format.
MySQL supports Multi-Value Indexes for JSON columns as of MySQL 8.0.17.
It seems like exactly your case.
Details: https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-multi-valued

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

Binary string search on one field

I have 300 boolean fields in one table, and I'm trying to do something like this:
One string field:
10000010000100100100100100010001
Here's a simple way to do a simple search of this field like:
select * from table where field xor "10000010000100100100000000010001"
I'm trying this but is to long:
select * from test where mid(info,2,1) and mid(info,3,1)
:) Help!!
A citation from the book High Performance MySQL:
If you used an integer, you could write that example as follows:
mysql> SET #CAN_READ := 1 << 0,
-> #CAN_WRITE := 1 << 1,
-> #CAN_DELETE := 1 << 2;
mysql> CREATE TABLE acl (
-> perms TINYINT UNSIGNED NOT NULL DEFAULT 0
-> );
mysql> INSERT INTO acl(perms) VALUES(#CAN_READ + #CAN_DELETE);
mysql> SELECT perms FROM acl WHERE perms & #CAN_READ;
+-------+
| perms |
+-------+
| 5 |
+-------+
UPD:
A possible solution in your case if all the strings are of the same length (I'll be surprised if they are not):
select * from teste where info like '_______00001001001001001___1___1';
The best way to handle this, if you can, would be to create another table that would link to the existing one as a many-to-one and you could then use a select statement to find all records in the subtable matching the ID of your parent table. In this example, the new table would be named info (after the column) and the prior table is named parent:
SELECT parent.*
FROM info
INNER JOIN parent
ON parent.id = info.parent_id
WHERE info.data IN ( 2, 3 ) // see note 1
GROUP BY parent.id
HAVING COUNT(*) = 2 // see note 2
Note 1: The positions within the string are now ints stored in the new table, in the column 'data'.
Note 2: You will need to specify the number of values specified above.
Thoughts: This query does not eliminate parent records where additional values match in the data table.
Applying bitmasking to other types than ints, like strings in your case, is possible if you write your own external function and do the comparison in there, but this is somewhat hardcore stuff if you are not familiar programming with C. MySQL itself only provides bitmask operators for the int type it uses internally.
Edit: Or use the 'LIKE' solution provided by newtower
See other similar topics: Is there a practical limit to the size of bit masks?
While efficient in space and possibly speed, this approach to solve the problem has same disadvantages as having a table with 300 columns. A table with so many columns is very inflexible and adding and removing values requires altering the structure of the table, rather than the data.
While compressing it into one field might seem to solve the problem, it actually makes the data contain the same structure in even more inflexible format, because you are removing the column:data semantics and solely relying on positional data. Making changes to this kind of data storage format will eventually be very time intensive and error prone to code.
It seems to me that you would do better if you inverted your problem and made a table that contains your booleans as rows, instead of having lots of columns. This is often the case. :)
Go with the table structure presented by JYelton instead, or something similar, if possible.

How to speed up SELECT .. LIKE queries in MySQL on multiple columns?

I have a MySQL table for which I do very frequent SELECT x, y, z FROM table WHERE x LIKE '%text%' OR y LIKE '%text%' OR z LIKE '%text%' queries. Would any kind of index help speed things up?
There are a few million records in the table. If there is anything that would speed up the search, would it seriously impact disk usage by the database files and the speed of INSERT and DELETE statements? (no UPDATE is ever performed)
Update: Quickly after posting, I have seen a lot of information and discussion about the way LIKE is used in the query; I would like to point out that the solution must use LIKE '%text%' (that is, the text I am looking for is prepended and appended with a % wildcard). The database also has to be local, for many reasons, including security.
An index wouldn't speed up the query, because for textual columns indexes work by indexing N characters starting from left. When you do LIKE '%text%' it can't use the index because there can be a variable number of characters before text.
What you should be doing is not use a query like that at all. Instead you should use something like FTS (Full Text Search) that MySQL supports for MyISAM tables. It's also pretty easy to make such indexing system yourself for non-MyISAM tables, you just need a separate index table where you store words and their relevant IDs in the actual table.
Update
Full text search available for InnoDB tables with MySQL 5.6+.
An index won't help text matching with a leading wildcard, an index can be used for:
LIKE 'text%'
But I'm guessing that won't cut it. For this type of query you really should be looking at a full text search provider if you want to scale the amount of records you can search across. My preferred provider is Sphinx, very full featured/fast etc. Lucene might also be worth a look. A fulltext index on a MyISAM table will also work, but ultimately pursuing MyISAM for any database that has a significant amount of writes isn't a good idea.
An index can not be used to speed up queries where the search criteria starts with a wildcard:
LIKE '%text%'
An index can (and might be, depending on selectivity) used for search terms of the form:
LIKE 'text%'
Add a Full Text Index and Use MATCH() AGAINST().
Normal indexes will not help you with like queries, especially those that utilize wildcards on both sides of the search term.
What you can do is add a full text index on the columns that you're interested in searching and then use a MATCH() AGAINST() query to search those full text indexes.
Add a full text index on the columns that you need:
ALTER TABLE table ADD FULLTEXT INDEX index_table_on_x_y_z (x, y, z);
Then query those columns:
SELECT * FROM table WHERE MATCH(x,y,z) AGAINST("text")
From our trials, we found these queries to take around 1ms in a table with over 1 million records. Not bad, especially compared to the equivalent wildcard LIKE %text% query which takes 16,400ms.
Benchmarks
MATCH(x,y,z) AGAINST("text") takes 1ms
LIKE %text% takes 16400ms
16400x faster!
I would add that in some cases you can speed up the query using an index together with like/rlike if the field you are looking at is often empty or contains something constant.
In that case it seems that you can limit the rows which are visited using the index by adding an "and" clause with the fixed value.
I tried this for searching 'tags' in a huge table which usually does not contain a lot of tags.
SELECT * FROM objects WHERE tags RLIKE("((^|,)tag(,|$))" AND tags!=''
If you have an index on tags you will see that it is used to limit the rows which are being searched.
Maybe you can try to upgrade mysql5.1 to mysql5.7.
I have about 70,000 records. And run following SQL:
select * from comics where name like '%test%';
It takes 2000ms in mysql5.1.
And it takes 200ms in mysql5.7 or mysql5.6.
Another way:
You can mantain calculated columns with those strings REVERSEd and use
SELECT x, y, z FROM table WHERE x LIKE 'text%' OR y LIKE 'text%' OR z LIKE 'text%' OR xRev LIKE 'txet%' OR yRev LIKE 'txet%' OR zRev LIKE 'txet%'
Example of how to ADD a stored persisted column
ALTER TABLE table ADD COLUMN xRev VARCHAR(N) GENERATED ALWAYS AS REVERSE(x) stored;
and then create an indexes on xRev, yRev etc.
Another alternative to avoid full table scans is selecting substrings and checking them in the having statement:
SELECT
al3.article_number,
SUBSTR(al3.article_number, 2, 3) AS art_nr_substr,
SUBSTR(al3.article_number, 1, 3) AS art_nr_substr2,
al1.*
FROM
t1 al1
INNER JOIN t2 al2 ON al2.t1_id = al1.id
INNER JOIN t3 al3 ON al3.id = al2.t3_id
WHERE
al1.created_at > '2018-05-29'
HAVING
(art_nr_substr = "FLA" OR art_nr_substr = 'VKV' OR art_nr_subst2 = 'PBR');
When you optimize a SELECT foo FROM bar WHERE baz LIKE 'ZOT%' query, you want the index length to at least match the number of characters in the request.
Here is a real life example from just now:
Here is the query:
EXPLAIN SELECT COUNT(*) FROM client_detail cd
JOIN client_account ca ON cd.client_acct_id = ca.client_acct_id
WHERE cd.first_name LIKE 'XX%' AND cd.last_name_index LIKE 'YY%';
With no index:
+-------+
| rows |
+-------+
| 13994 |
| 1 |
+-------+
So first try a 4x index,
CREATE INDEX idx_last_first_4x4 on client_detail(last_name_index(4), first_name(4));
+------+
| rows |
+------+
| 7035 |
| 1 |
+------+
A bit better, but COUNT(*) shows there are only 102 results. So lets now add a 2x index:
CREATE INDEX idx_last_first_2x2 on client_detail(last_name_index(2), first_name(2));
yields:
+------+
| rows |
+------+
| 102 |
| 1 |
+------+
Both indexes are still in place at this point, and MySQL chose the latter index for this query---however it will still choose the 4x4 query if it is more efficient.
Index ordering may be useful, try the 2x2 before the 4x4 or vice-versa to see how it performs for your environment. To re-order an index you have to drop and re-create the earlier one.