How to index a wide table of booleans - mysql

My question is on building indexes when your client is using a lot of little fields.
Consider a search of the following:
(can't change it, this is what the client is providing)
SKU zone1 zone2 zone3 zone4 zone5 zone6 zone7 zone8 zone9 zone10 zone11
A123 1 1 1 1 1 1 1 1
B234 1 1 1 1 1 1 1
C345 1 1 1 1 1 1
But it is much wider, and there are many more categories than just Zone.
The user will be looking for skus that match at least one of the selected zones. I intend to query this with (if the user checked "zone2, zone4, zone6")
select SKU from TABLE1 where (1 IN (zone2,zone4,zone6))
Is there any advantage to indexing with a multi tiered index like so:
create index zones on table1 (zone1,zone2,zone3,zone4,zone5,zone6,zone7,zone8,zone9,zone10,zone11)
Or will that only be beneficial when the user checked zone1?
Thanks,
Rob

You should structure the data as:
create table SKuZones (
Sku int not null,
zone varchar(255)
)
It would be populated with the places where a SKU has a 1. This can then take great advantage of an index on SKUZones(zone) for an index. A query such as:
select SKU
from SKUZones
where zone in ('zone2', 'zone4', 'zone6');
will readily take advantage of an index. However, if the data is not structured in a way appropriate for a relational database, then it is much harder to make queries efficient.
One approach you could take if you can add a column to the table is the following:
Add a new column called zones or something similar.
Use a trigger to populate it with values for each "1" in the columns (so "zone3 zone4 zone5 . . ." for the first row in your data).
Build a full text index on the column.
Run your query using match against

Indexing boolean values is almost always useless.
What if you use a SET datatype? Or BIGINT UNSIGNED?
Let's talk through how to do it with some sized INT, named zones
zone1 is the bottom bit (1<<0 = 1)
zone2 is the next bit (1<<1 = 2)
zone3 is the next bit (1<<2 = 4)
zone4 is the next bit (1<<3 = 8)
etc.
where (1 IN (zone2,zone4,zone6)) becomes
where (zones & 42) != 0.
To check for all 3 zones being set: where (zones & 42) = 42.
As for indexing, no index will help this design; there will still be a table scan.
If there are 11 zones, then SMALLINT UNSIGNED (2 bytes) will suffice. This will be considerably more compact than other designs, hence possibly faster.
For this query, you can have a "covering" index, which helps some:
select SKU from TABLE1 where (zones & 42) != 0 .. INDEX(zones, SKU)
(Edit)
42 = 32 | 8 | 2 = zone6 | zone4 | zone2 -- where | is the bitwise OR operator.
& is the bitwise AND operator. See http://dev.mysql.com/doc/refman/5.6/en/non-typed-operators.html
(zones & 42) = 42 effectively checks that all 3 of those bits are "on".
(zones & 42) = 0 effectively checks that all 3 of those bits are "off".
In both cases, it is ignoring the other bits.
42 could be represented as ((1<<5) | (1<<3) | (1<<1)). Because of precedence rules, I recommend using more parentheses than you might think necessary.
1 << 5 means "shift" 1 by 5 bits.

Related

How do I Query for used BETWEEN Operater for text searches in MySql database?

I have a SQL Table in that i use BETWEEN Operater.
The BETWEEN Operater selects values within range. The values can be numbers, text , dates.
stu_id name city pin
1 Raj Ranchi 123456
2 sonu Delhi 652345
3 ANU KOLKATA 879845
4 K.K's Company Delhi 345546
5 J.K's Company Delhi 123456
I have a query like this:-
SELECT * FROM student WHERE stu_id BETWEEN 2 AND 4 //including 2 & 4
SELECT * FROM `student` WHERE name between 'A' and 'K' //including A & not K
Here My Question is why not including K.
but I want K also in searches.
Don't use between -- until you really understand it. That is just general advice. BETWEEN is inclusive, so your second query is equivalent to:
WHERE name >= 'A' AND
name <= 'K'
Because of the equality, 'K' is included in the result set. However, names longer than one character and starting with 'K' are not -- "Ka" for instance.
Instead, be explicit:
WHERE name >= 'A' AND
name < 'L'
Of course, BETWEEN can be useful. However, it is useful for discrete values, such as integers. It is a bit dangerous with numbers with decimals, strings, and date/time values. That is why I encourage you to express the logic as inequalities.
In supplement to gordon's answer, one way to get what you're expecting is to turn your name into a discrete set of values:
SELECT * FROM `student` WHERE LEFT(name, 1) between 'A' and 'K'
You need to appreciate that K.K's Company is alphabetically AFTER the letter K on its own so it is not BETWEEN, in the same way that 4.1 is not BETWEEN 2 and 4
By stripping it down to just a single character from the start of the string it will work like you expect, but take cautionary note, you should always avoid running functions on values in tables, because if you had a million names, thats a million strings that mysql has to strip out to just the first letter and it might no longer be able to use an index on name, battering the performance.
Instead, you could :
SELECT * FROM `student` WHERE name >= 'A' and name < 'L'
which is more likely to permit the use of an index as you aren't manipulating the stored values before comparing them
This works because it asks for everything up to but not including L.. Which includes all of your names starting with K, even kzzzzzzzz. Numerically it is equivalent to saying number >= 2 and number < 5 which gives you all the numbers starting with 2, 3 or 4 (like the 4.1 from before) but not the 5
Remember that BETWEEN is inclusive at both ends. Always revert to a pattern of a >= b and a < c, a >= c and a < d when you want to specify ranges that capture all possible values
Compare in lexicographical order, 'K.K's Company' > 'K'
We should convert the string to integer. You can try that mysql script with CAST and SUBSTRING. I've updated your script here. It will include the last record as well.
SELECT * FROM student WHERE name CAST(SUBSTRING(username FROM 1) AS UNSIGNED)
BETWEEN 'A' AND 'K';
The script will work. Hope it will helps to you.
Here I've attached my test sample.

Recursively running a MySQL function

I have a function in MySQL that needs to be run about 50 times (not a set value) in a query. the inputs are currently stored in an array such as
[1,2,3,4,5,6,7,8,9,10]
when executing the MySQL query individually it's working fine, please see below
column_name denotes the column it's getting the data for, in this case, it's a DOUBLE in the database
The second value in the MOD() function is the input I'm supplying MySQL from the aforementioned array
SELECT id, MOD(column_name, 4) AS mod_output
FROM table
HAVING mod_output > 10
To achieve the output I require* the following code works
SELECT id, MOD(column_name, 4) AS mod_output1, MOD(column_name, 5) AS mod_output2, MOD(column_name, 6) AS mod_output3
FROM table
HAVING mod_output1 > 10 AND mod_output2 > 10 AND mod_output3 > 10
However this obviously is extremely dirty, and when having not 3 inputs, but over 50, this will become highly inefficient.
Appart from calling over 50 individual querys, is there a better way to acchieve the same sort (see below) of output?
In escennce i need to supply MySQL with a list of values and have it run MOD() over all of them on a specified column.
The only data I need returned is the id's of the rows that match the MOD() functions output with the specified input (see value 2 of the MOD() function) where the output is less than 10
Please note, MOD() has been used as an example function, however, the final function required *should* be a drop in replacement
example table layout
id | column_name
1 | 0.234977
2 | 0.957739
3 | 2.499387
4 | 48.395777
5 | 9.943782
6 | -39.234894
7 | 23.49859
.....
(The title may be worded wrong, I'm not quite sure how else you'd explain what I'm trying to do here)
Use a join and derived table or temporary table:
SELECT n.n, t.id, MOD(t.column_name, n.n) AS mod_output
FROM table t CROSS JOIN
(SELECT 4 as n UNION ALL SELECT 5 UNION ALL SELECT 6 . . .
) n
WHERE MOD(t.column_name, n.n) > 10;
If you want the results as columns, you can use conditional aggregation afterwards.

How to Find First Valid Row in SQL Based on Difference of Column Values

I am trying to find a reliable query which returns the first instance of an acceptable insert range.
Research:
some of the below links adress similar questions, but I could get none of them to work for me.
Find first available date, given a date range in SQL
Find closest date in SQL Server
MySQL difference between two rows of a SELECT Statement
How to find a gap in range in SQL
and more...
Objective Query Function:
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
Where InsertRange(1) is the value the query should return. In other words, this would be the first instance where the above condition is satisfied.
Table Structure:
Primary Key: StartRange
StartRange(i-1) < StartRange(i)
StartRange(i-1) + EndRange(i-1) < StartRange(i)
Example Dataset
Below is an example User table (3 columns), with a set range distribution. StartRanges are always ordered in a strictly ascending way, UserID are arbitrary strings, only the sequences of StartRange and EndRange matters:
StartRange EndRange UserID
312 6896 user0
7134 16268 user1
16877 22451 user2
23137 25142 user3
25955 28272 user4
28313 35172 user5
35593 38007 user6
38319 38495 user7
38565 45200 user8
46136 48007 user9
My current Query
I am trying to use this query at the moment:
SELECT t2.StartRange, t2.EndRange
FROM user AS t1, user AS t2
WHERE (t1.StartRange - t2.StartRange+1) > NewValue
ORDER BY t1.EndRange
LIMIT 1
Example Case
Given the table, if NewValue = 800, then the returned answer should be 23137. This means, the first available slot would be between user3 and user4 (with an actual slot size = 813):
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
InsertRange = (StartRange(6) - EndRange(5)) > NewValue
23137 = 25955 - 25142 > 800
More Comments
My query above seemed to be working for the special case where StartRanges where tightly packed (i.e. StartRange(i) = StartRange(i-1) + EndRange(i-1) + 1). This no longer works with a less tightly packed set of StartRanges
Keep in mind that SQL tables have no implicit row order. It seems fair to order your table by StartRange value, though.
We can start to solve this by writing a query to obtain each row paired with the row preceding it. In MySQL, it's hard to do this beautifully because it lacks the row numbering function.
This works (http://sqlfiddle.com/#!9/4437c0/7/0). It may have nasty performance because it generates O(n^2) intermediate rows. There's no row for user0; it can't be paired with any preceding row because there is none.
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
Then, you can use that as a subquery, and apply your conditions, which are
gap >= 800
first matching row (lowest StartRange value) ORDER BY SB
just one LIMIT 1
Here's the query (http://sqlfiddle.com/#!9/4437c0/11/0)
SELECT SB-EA Gap,
EA+1 Beginning_of_gap, SB-1 Ending_of_gap,
UserId UserID_after_gap
FROM (
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
) pairs
WHERE SB-EA >= 800
ORDER BY SB
LIMIT 1
Notice that you may actually want the smallest matching gap instead of the first matching gap. That's called best fit, rather than first fit. To get that you use ORDER BY SB-EA instead.
Edit: There is another way to use MySQL to join adjacent rows, that doesn't have the O(n^2) performance issue. It involves employing user variables to simulate a row_number() function. The query involved is a hairball (that's a technical term). It's described in the third alternative of the answer to this question. How do I pair rows together in MYSQL?

Oracle SQL when querying a range of data

I have a table that for an ID, will have data in several bucket fields. I want a function to pull out a sum of buckets, but the function parameters will include the start and end bucket field.
So, if I had a table like this:
ID Bucket0 Bucket30 Bucket60 Bucket90 Bucket120
10 5.00 12.00 10.00 0.0 8.00
If I send in the ID and the parameters Bucket0, Bucket0, it would return only the value in the Bucket0 field: 5.00
If I send in the ID and the parameters Bucket30, Bucket120, it would return the sum of the buckets from 30 to 120, or (12+10+0+8) 30.00.
Is there a nicer way to write this other than a huge ugly
if parameter1=bucket0 and parameter2=bucket0
then select bucket0
else if parameter1=bucket0 and parameter2=bucket1
then select bucket0 + bucket1
else if parameter1=bucket0 and parameter2=bucket2
then select bucket0 + bucket1 + bucket2
and so on?
The table already exists, so I don't have a lot of control over that. I can make my parameters for the function however I want. I can safely say that if a set of buckets are wanted, none in the middle will be skipped, so specifying start and end buckets would work. I could have a single comma delimited string of all buckets wanted.
It would have been better if your table had been normalised, like this:
id | bucket | value
---+-----------+------
10 | bucket000 | 5
10 | bucket030 | 12
10 | bucket060 | 10
10 | bucket090 | 0
10 | bucket120 | 8
Also, the buckets should better have names that are easy to compare in ranges, so that bucket030 comes between bucket000 and bucket120 in the normal alphabetical order, which is not the case if you leave out the padded zeroes.
If the above normalisation is not possible, then use an unpivot clause to turn your current table into the structure depicted above:
select id, sum(value)
from (
select *
from mytable
unpivot (value for bucket_id in (bucket0 as 'bucket000',
bucket30 as 'bucket030',
bucket60 as 'bucket060',
bucket90 as 'bucket090',
bucket120 as 'bucket120'))
) normalised
where bucket_id between 'bucket000' and 'bucket060'
group by id
When you do this with parameter variables, make sure those parameters have the padded zeroes as well.
You could for instance ensure that as follows for parameter1:
if parameter1 like 'bucket%' then
parameter1 := 'bucket' || lpad(+substr(parameter1, 7), 3, '0');
end if;
...etc.

Mysql index on integer bits

I have a mysql like:
id (UNSIGNED INT) PrimaryKey AutoIncrement
name (VARCHAR(10)
status UNSINGED INT Indexed
I use the status column to represent 32 different statuses like:
0 -> open
1 -> deleted
...
31 -> something
This is convenient to use since I do not know how many statuses I have (Now we support 32 statuses , we can use a long int to support 64, if more than 64 (highly unlikely we will see :) )
The prolem with this approach is that there is no index in the
bit level -> queries selecting where a bit is set are slow.
I can improve a bit using range queries -> where status between n1 and n2 .
Still this is not a good approach.
I want to point out that I want to search only if a few of the 32 bits are set (let's say bits 0, 12 , 13, 21, 31).
any ideas to improve perfomance?
If for some reason you cannot normalize your data as suggested by RandomSeed in the previous answer, I'm pretty sure you can just put an index on the field and search using int values (that is 2^n).
For example if you need bit 0, 12 and 13 set, search where status = 2^0 + 2^12 + 2^13.
Edit: If you need to search where those bits are set, regardless of other bits, you could try using bitwise operators, e.g. for bits 0, 12 and 13, search where status & 1 = 1 and status & 4096 = 4096 and status & 8192 = 8192
However compared to a ranged query I'm not sure what will be the performance improvement (if any). So as said before, normalization might be the only solution.
Normalize your data.
MainEntity:
id (UNSIGNED INT) PrimaryKey AutoIncrement
name (VARCHAR(10)
Status:
id (UNSIGNED INT) PrimaryKey AutoIncrement
label (VARCHAR(10))
EntityHasStatus:
entity_id (UNSIGNED INT) PrimaryKey
status_id (UNSIGNED INT) PrimaryKey
Entities having both statuses 1 and 5:
SELECT MainEntity.*
FROM MainEntity
JOIN EntityHasStatus AS Status1
ON entity_id = MainEntity.id
AND Status1.status_id = 1
JOIN EntityHasStatus AS Status5
ON entity_id = MainEntity.id
AND Status1.status_id = 5
Entities having either status 4 or 6:
SELECT MainEntity.*
FROM MainEntity
LEFT JOIN EntityHasStatus AS Status4
ON entity_id = MainEntity.id
AND Status4.status_id = 4
LEFT JOIN EntityHasStatus AS Status6
ON entity_id = MainEntity.id
AND Status6.status_id = 6
WHERE
Status4.status_id IS NOT NULL
OR Status6.status_id IS NOT NULL
These queries should be virtually instant (prefer the first form when possible, as it is a tad bit more efficient).