Mysql index on integer bits - mysql

I have a mysql like:
id (UNSIGNED INT) PrimaryKey AutoIncrement
name (VARCHAR(10)
status UNSINGED INT Indexed
I use the status column to represent 32 different statuses like:
0 -> open
1 -> deleted
...
31 -> something
This is convenient to use since I do not know how many statuses I have (Now we support 32 statuses , we can use a long int to support 64, if more than 64 (highly unlikely we will see :) )
The prolem with this approach is that there is no index in the
bit level -> queries selecting where a bit is set are slow.
I can improve a bit using range queries -> where status between n1 and n2 .
Still this is not a good approach.
I want to point out that I want to search only if a few of the 32 bits are set (let's say bits 0, 12 , 13, 21, 31).
any ideas to improve perfomance?

If for some reason you cannot normalize your data as suggested by RandomSeed in the previous answer, I'm pretty sure you can just put an index on the field and search using int values (that is 2^n).
For example if you need bit 0, 12 and 13 set, search where status = 2^0 + 2^12 + 2^13.
Edit: If you need to search where those bits are set, regardless of other bits, you could try using bitwise operators, e.g. for bits 0, 12 and 13, search where status & 1 = 1 and status & 4096 = 4096 and status & 8192 = 8192
However compared to a ranged query I'm not sure what will be the performance improvement (if any). So as said before, normalization might be the only solution.

Normalize your data.
MainEntity:
id (UNSIGNED INT) PrimaryKey AutoIncrement
name (VARCHAR(10)
Status:
id (UNSIGNED INT) PrimaryKey AutoIncrement
label (VARCHAR(10))
EntityHasStatus:
entity_id (UNSIGNED INT) PrimaryKey
status_id (UNSIGNED INT) PrimaryKey
Entities having both statuses 1 and 5:
SELECT MainEntity.*
FROM MainEntity
JOIN EntityHasStatus AS Status1
ON entity_id = MainEntity.id
AND Status1.status_id = 1
JOIN EntityHasStatus AS Status5
ON entity_id = MainEntity.id
AND Status1.status_id = 5
Entities having either status 4 or 6:
SELECT MainEntity.*
FROM MainEntity
LEFT JOIN EntityHasStatus AS Status4
ON entity_id = MainEntity.id
AND Status4.status_id = 4
LEFT JOIN EntityHasStatus AS Status6
ON entity_id = MainEntity.id
AND Status6.status_id = 6
WHERE
Status4.status_id IS NOT NULL
OR Status6.status_id IS NOT NULL
These queries should be virtually instant (prefer the first form when possible, as it is a tad bit more efficient).

Related

How to speed up a query containing HAVING?

I have a table with close to a billion records, and need to query it with HAVING. It's very slow (about 15 minutes on decent hardware). How to speed it up?
SELECT ((mean - 3.0E-4)/(stddev/sqrt(N))) as t, ttest.strategyid, mean, stddev, N,
kurtosis, strategies.strategyId
FROM ttest,strategies
WHERE ttest.strategyid=strategies.id AND dataset=3 AND patternclassid="1"
AND exitclassid="1" AND N>= 300 HAVING t>=1.8
I think the problem is t cannot be indexed because it needs to be computed. I cannot add it as a column because the '3.0E-4' will vary per query.
Table:
create table ttest (
strategyid bigint,
patternclassid integer not null,
exitclassid integer not null,
dataset integer not null,
N integer,
mean double,
stddev double,
skewness double,
kurtosis double,
primary key (strategyid, dataset)
);
create index ti3 on ttest (mean);
create index ti4 on ttest (dataset,patternclassid,exitclassid,N);
create table strategies (
id bigint ,
strategyId varchar(500),
primary key(id),
unique key(strategyId)
);
explain select.. :
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
ttest
NULL
range
PRIMARY,ti4
ti4
17
NULL
1910344
100.00
Using index condition; Using MRR
1
SIMPLE
strategies
NULL
eq_ref
PRIMARY
PRIMARY
8
Jellyfish_test.ttest.strategyid
1
100.00
Using where
The query needs to reformulated and an index needs to be added.
Plan A:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
s.strategyId
FROM ttest AS tt
JOIN strategies AS s ON tt.strategyid = s.id
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
and a 'composite' and 'covering' index on test. Replace your ti4 with this (to make it 'covering'):
INDEX(dataset, patternclassid, exitclassid, -- any order
N, strategyid) -- in this order
Plan B:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
( SELECT s.strategyId
FROM strategies AS s
WHERE s.id = tt.strategyid = s.id
) AS strategyId
FROM ttest AS tt
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
With the same index.
Unfortunately the expression for t needs to be repeated. By moving it from HAVING to WHERE, avoids gathering unwanted rows, only to end up throwing them away. Maybe the optimizer will do that automatically. Please provide EXPLAIN SELECT ... to see.
Also, it is unclear whether one of the two formulations will run faster than the other.
To be honest, I've never seen HAVING being used like this; for 20+ years I've assumed it can only be used in GROUP BY situations!
Anyway, IMHO you don't need it here, as Rick James points out, you can put it all in the WHERE.
Rewriting it a bit I end up with:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
Most of that we can indeed foresee a reasonable index. The problem remains with the last calculation:
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
However, before we go to that: how many rows are there if you ignore this 'formula'? 100? 200? If so, indexing as foreseen in Rick James' answer should be sufficient IMHO.
If it's 1000's or many more than the question becomes: how much of those are thrown out by the formula? 1%? 50% 99%? If it's on the low side then again, indexing as proposed by Rick James will do. If however you only need to keep a few you may want to further optimize this and index accordingly.
From your explanation I understand that 3.0E-4 is variable so we can't include it in the index.. so we'll need to extract the parts we can:
If my algebra isn't failing me you can play with the formula like this:
AND ((t.mean - 3.0E-4) / (t.stddev / sqrt(t.N))) >= 1.8
AND ((t.mean - 3.0E-4) ) >= 1.8 * (t.stddev / sqrt(t.N))
AND t.mean - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N)))
AND - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N))) - t.mean
So the query becomes:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND (1.8 * (t.stddev / sqrt(t.N))) - t.mean <= -3.0E-4
I'm not familiar with mysql but glancing the documentation it should be possible to include 'generated columns' in the index. So, we'll do exactly that with (1.8 * (t.stddev / sqrt(t.N)) - t.mean).
Your indexed fields thus become:
dataset, paternclassid, exitclassid, N, (1.8 * (t.stddev / sqrt(t.N))) - t.mean)
Note that the system will have to calculate this value for each and every row on insert (and possibly update) you do on the table. However, once there (and indexed) it should make the query quite a bit faster.

How can I make my select statement deterministically match only 1/n of my dataset?

I'm processing data from a MySQL table where each row has a UUID associated with it. EDIT: the "UUID" is in fact an MD5 hash (VARCHAR) of the job text.
My select query looks something like:
SELECT * FROM jobs ORDER BY priority DESC LIMIT 1
I am only running one worker node right now, but would like to scale it out to several nodes without altering my schema.
The issue is that the jobs take some time, and scaling out beyond one right now would introduce a race condition where several nodes are working on the same job before it completes and the row is updated.
Is there an elegant way to effectively "shard" the data on the client-side, by specifying some modifier config value per worker node? My first thought was to use the MOD function like this:
SELECT * FROM jobs WHERE UUID MOD 2 = 0 ORDER BY priority DESC LIMIT 1
and SELECT * FROM jobs WHERE UUID MOD 2 = 1 ORDER BY priority DESC LIMIT 1
In this case I would have two workers configured as "0" and "1". But this isn't giving me an even distribution (not sure why) and feels clunky. Is there a better way?
The problem is you're storing the ID as a hex string like acbd18db4cc2f85cedef654fccc4a4d8. MySQL will not convert the hex for you. Instead, if it starts with a letter you get 0. If it starts with a number, you get the starting numbers.
select '123abc' + 0 = 123
select 'abc123' + 0 = 0
6 out of 16 will start with a letter so they will all be 0 and 0 mod anything is 0. The remaining 10 of 16 will be some number so will be distributed properly, 5 of 16 will be 0, 5 of 16 will be 1. 6/16 + 5/16 = 69% will be 0 which is very close to your observed 72%.
To do this right we need to convert the 128 hex string into a 64 bit unsigned integer.
Slice off 64 bits with either left(uuid, 16) or right(uuid, 16).
Convert the hex (base 16) into decimal (base 10) using conv.
cast the result to an unsigned bigint. If we skip this step MySQL appears to use a float which loses accurracy.
select cast(conv(right(uuid, 16), 16, 10) as unsigned) mod 2
Beautiful.
That will only use 64 bits of the 128 bit checksum, but for this purpose that should be fine.
Note this technique works with an MD5 checksum because it is pseudorandom. It will not work with the default MySQL uuid() function which is a UUID version 1. UUIDv1 is a timestamp + a fixed ID and will always mod the same.
UUIDv4, which is a random number, will work.
Convert the hex string to decimal before modding:
where CONV(substring(uuid, 1, 8), 16, 10) mod 2 = 1
A reasonable hashing function should distribute evenly enough for this purpose.
Use substring to convert only a small part so the conv doesn't overflow decimal range and maybe behave badly. Any subset of bits should also be well distributed.

How to write sql query where the WHERE clause uses a substring casted to a long?

In a table called "accounts" there is an account id that is a 13 character long string, where the first 8 digits are the user id who owns that account. How do I query the database with an integer and check the first 8 characters only?
I was trying to do something like this:
SELECT * FROM networthr.accounts WHERE CAST(SUBSTRING(account_id, 0, 8) as long) = 1;
But it won't even let me run this query.
There are 2 problems with your query:
1) The 2nd argument of SUBSTRING() should be 1 (the index is 1 based not 0 based)
2) You should cast to the data type UNSIGNED
SELECT * FROM networthr.accounts WHERE CAST(SUBSTRING(account_id, 1, 8) as unsigned ) = 1;
This looks like a bad design. However - If the account_id is zero padded like "00000001ABCDE" and you have an index on it, an efficient way would be
SELECT *
FROM networthr.accounts
WHERE account_id LIKE CONCAT(LPAD(?, 8, 0), '%')
Replace ? with the user_id or use it as a prepared statement and bind user_id as parameter.
In case of user_id = 1 it's the same as
WHERE account_id LIKE '00000001%'
You can use implicit conversion:
WHERE LEFT(account_id, 8) + 0 = 1
However, you should really be comparing strings to string.

How to index a wide table of booleans

My question is on building indexes when your client is using a lot of little fields.
Consider a search of the following:
(can't change it, this is what the client is providing)
SKU zone1 zone2 zone3 zone4 zone5 zone6 zone7 zone8 zone9 zone10 zone11
A123 1 1 1 1 1 1 1 1
B234 1 1 1 1 1 1 1
C345 1 1 1 1 1 1
But it is much wider, and there are many more categories than just Zone.
The user will be looking for skus that match at least one of the selected zones. I intend to query this with (if the user checked "zone2, zone4, zone6")
select SKU from TABLE1 where (1 IN (zone2,zone4,zone6))
Is there any advantage to indexing with a multi tiered index like so:
create index zones on table1 (zone1,zone2,zone3,zone4,zone5,zone6,zone7,zone8,zone9,zone10,zone11)
Or will that only be beneficial when the user checked zone1?
Thanks,
Rob
You should structure the data as:
create table SKuZones (
Sku int not null,
zone varchar(255)
)
It would be populated with the places where a SKU has a 1. This can then take great advantage of an index on SKUZones(zone) for an index. A query such as:
select SKU
from SKUZones
where zone in ('zone2', 'zone4', 'zone6');
will readily take advantage of an index. However, if the data is not structured in a way appropriate for a relational database, then it is much harder to make queries efficient.
One approach you could take if you can add a column to the table is the following:
Add a new column called zones or something similar.
Use a trigger to populate it with values for each "1" in the columns (so "zone3 zone4 zone5 . . ." for the first row in your data).
Build a full text index on the column.
Run your query using match against
Indexing boolean values is almost always useless.
What if you use a SET datatype? Or BIGINT UNSIGNED?
Let's talk through how to do it with some sized INT, named zones
zone1 is the bottom bit (1<<0 = 1)
zone2 is the next bit (1<<1 = 2)
zone3 is the next bit (1<<2 = 4)
zone4 is the next bit (1<<3 = 8)
etc.
where (1 IN (zone2,zone4,zone6)) becomes
where (zones & 42) != 0.
To check for all 3 zones being set: where (zones & 42) = 42.
As for indexing, no index will help this design; there will still be a table scan.
If there are 11 zones, then SMALLINT UNSIGNED (2 bytes) will suffice. This will be considerably more compact than other designs, hence possibly faster.
For this query, you can have a "covering" index, which helps some:
select SKU from TABLE1 where (zones & 42) != 0 .. INDEX(zones, SKU)
(Edit)
42 = 32 | 8 | 2 = zone6 | zone4 | zone2 -- where | is the bitwise OR operator.
& is the bitwise AND operator. See http://dev.mysql.com/doc/refman/5.6/en/non-typed-operators.html
(zones & 42) = 42 effectively checks that all 3 of those bits are "on".
(zones & 42) = 0 effectively checks that all 3 of those bits are "off".
In both cases, it is ignoring the other bits.
42 could be represented as ((1<<5) | (1<<3) | (1<<1)). Because of precedence rules, I recommend using more parentheses than you might think necessary.
1 << 5 means "shift" 1 by 5 bits.

How to get the space consumed by a set of records in a table?

I have a table called HUGETABLE. I want to get the size consumed by a set of records in HUGETABLE under some Select where criteria. I am currently using the following query for getting the table size in KBS , is there any Other improved query other than this? :
SELECT SUM(LENGTH(IFNULL(translation_logs.GUID,0))+
LENGTH(IFNULL(HUGETABLE.ID,0))+
LENGTH(IFNULL(HUGETABLE.SEQUENCENO,0))+
LENGTH(IFNULL(HUGETABLE.BID,0))+
LENGTH(IFNULL(HUGETABLE.TID,0))+
LENGTH(IFNULL(HUGETABLE.TABLENAME,0))+
LENGTH(IFNULL(HUGETABLE.MODIFIEDDATE,0))+
LENGTH(IFNULL(HUGETABLE.MODIFIEDBY,0))+
LENGTH(IFNULL(HUGETABLE.UPDATEXML,0))+
LENGTH(IFNULL(HUGETABLE.BESTIDENTIFIERVALUE,0)))/1024 AS "Total Size in KB"
FROM HUGETABLE
WHERE TID = 'myvalue';
and TID is the index field
Instead of making a huge query, you can check via the maximal size of each field. If you have VARCHAR, it'll be a bit difficult and in that case you have to do a selection.
According to this documentation page, an INT takes 4 bytes, a DATE 3 bytes, etc. So with 5 INT (your 5 first fields), 1 DATE and 4 VARCHAR (I'm guessing not sure about your table structure), you just have to do something like:
SELECT 4*5 + 1 * 3 + (LENGTH(TABLENAME)+2) + (LENGTH(MODIFYBY)+2) + ...
FROM HUGETABLE
WHERE TID = 'myvalue';
If you want to also take the space used by the index, it's a bit more difficult, and I'm not informed enough about it.