MySQL: Joins vs. Bitwise operator, and performance thereof - mysql

There are a number of questions about this subject, but mine is more specific to performance concerns.
With regards to an object, I want to track a multitude of 'attributes', each with a multitude of discrete 'values' (each attribute have between 3 and 16 valid 'values'.) For instance, consider tracking military personnel. The attributes/values might be (not real, I totally made these up):
attribute: {values}
languages_spoken: {english, spanish, russian, chinese, …. }
certificates: {infantry, airborne, pilot, tank_driver…..}
approved_equipment: {m4, rocket_launcher, shovel, super_secret_radio_thingy….}
approved_operations: {reconnaissance, logistics, invasion, cooking, ….}
awards_won: {medal_honor, purple_heart, ….}
… and so on.
One one to do this - the way I want to do this - is to have a personnel table and an attributes table:
personnel table => [id, name, rank, address…..]
personnel_attributes table => [personnel_id, attribute_id, value_id]
along with the associated attributes and values tables.
So if pesonnel_id=31415 is approved for logistics, there would be the following entry in the personnel_attributes table:
personnel_id | attribute_id | value_id
31415 | 3 | 2
where 3 = attribute_id for "approved_operations" and 2 = value_id for "logistics" (sorry formatting spaces didn't line up.)
Then a search to find all personnel who speak english OR spanish, AND who is infantry OR airborne, AND can operate a shovel OR super_secret_radio_thingy would be something like:
SELECT t1.personnel_id
FROM personnel_attributes t1, personnel_attributes t2, personnel_attributes t3
WHERE ((t1.attribute_id = 1 and t1.value_id = 1) OR (t1.attribute_id = 1 and t1.value_id = 2))
AND ((t2.attribute_id = 2 and t1.value_id = 1) OR (t2.attribute_id = 2 and t1.value_id = 2))
AND ((t3.attribute_id = 3 and t1.value_id = 3) OR (t3.attribute_id = 3 and t1.value_id = 4))
AND t2.personnel_id = t1.personnel_id
AND t3.personnel_id = t1.personnel_id;
Assuming this isn't a totally stupid way to write the SQL query, the problem is that its very slow (even with seemingly relevant indexes.)
So I'm am toying with using bitwise operators instead, where each attribute is a column in a table and each value is a bit. The same search would be:
SELECT personnel_id FROM personnel_attributes
WHERE language & b'00000011'
AND certificates & b'00000011'
AND approved_operations & b'00001100';
I know this does a full table scan, but in my experiments with 350,000 sample personnel, and 16 attributes each, the first method took 20 seconds whereas the bitwise method took 38 milliseconds!
Am I doing something wrong here? Are these the performance results I should expect?
Thanks!

Using the bitwise operation will require evaluating all of the rows. I believe your problem can be solved with a change to your original SELECT statement and how you're joing your tables:
To make it a little easier to read, I've changed attribute values to words instead of integers so it's less confusing while reading through my example, but obviously you can leave them as integers and it concept would still work:
CREATE TABLE PERSONNEL (
ID INT,
NAME VARCHAR(20)
)
CREATE TABLE PERSONNEL_ATTRIBUTES (
PERSONNEL_ID INT,
ATTRIB_ID INT,
ATTRIB_VALUE VARCHAR(20)
)
INSERT INTO PERSONNEL VALUES (1, 'JIM SMITH')
INSERT INTO PERSONNEL VALUES (2, 'JANE DOE')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (1, 1, 'English')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (1, 1, 'Spanish')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (1, 1, 'Russian')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (1, 3, 'Logistics')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (1, 3, 'Infantry')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (2, 1, 'English')
INSERT INTO PERSONNEL_ATTRIBUTES VALUES (2, 3, 'Infantry')
SELECT P.ID, P.NAME, PA1.ATTRIB_VALUE AS DESIRED_LANGUAGE, PA2.ATTRIB_VALUE AS APPROVED_OPERATION
FROM PERSONNEL P
JOIN PERSONNEL_ATTRIBUTES PA1 ON P.ID = PA1.PERSONNEL_ID AND PA1.ATTRIB_ID = 1
JOIN PERSONNEL_ATTRIBUTES PA2 ON P.ID = PA2.PERSONNEL_ID AND PA2.ATTRIB_ID = 3
WHERE PA1.ATTRIB_VALUE = 'Spanish' AND (PA2.ATTRIB_VALUE = 'Infantry' OR PA2.ATTRIB_VALUE = 'Airborne')

Have the same issue of using django-bitfield or a separate table for flags.
Inspired by your experiment, I used a 3.5m record table (innodb) and made count() and retrieve queries for both variants. the result was astonishing: approx 5sec vs. 40sec bitfield wins.

Related

SSRS - Lookup only on certain columns in a matrix

I have a matrix table with a column group "Application questions" let's say these are in table 1. Some of the questions have unique string values such as: Name, ID number, email address. But others have an integer value that relates to an actual value for a separate lookup table (table 2), for example, the values for the column "Gender" are 1, 2, 3, for Male, Female, Other. Is there a way in the lookup function that I can isolate the columns that only have integer values or alternatively ignore the other columns with unique string values?
Table1
NAME ATTRIBUTE_id ATTRIBUTE
-----------------------------------------
James 5 1
James 6 james#email.com
James 7 8
Table2
Lookup_id ATTRIBUTE_id Description
-----------------------------------------
1 5 Male
2 5 Female
3 5 Other
8 7 New York
9 7 Los Angeles
Output
NAME | Email | Gender | City
-------------------------------------------------------
James james#email.com Male New York
Hope that makes sense!
Thank you.
I think this will be easier to do in your dataset query.
Below I have recreated your sample data and added an extra person in to make sure it's working as expected.
DECLARE #t TABLE (Name varchar(10), AttributeID INT, AttributeMemberID varchar(50))
INSERT INTO #t VALUES
('Mary', 5, '2'),
('Mary', 6, 'Mary#email.com'),
('James', 5, '1'),
('James', 6, 'james#email.com'),
('James', 7, '8')
DECLARE #AttributeMembers TABLE (AttributeMemberID INT, AttributeID int, Description varchar(20))
INSERT INTO #AttributeMembers VALUES
(1, 5, 'Male'),
(2, 5, 'Female'),
(3, 5, 'Other'),
(8, 7, 'New York'),
(9, 7, 'Los Angeles')
I also added in a new table which describes what each attribute is. We will use the output from this as column headers in the final SSRS matrix.
DECLARE #Attributes TABLE(AttributeID int, Caption varchar(50))
INSERT INTO #Attributes VALUES
(5, 'Gender'),
(6, 'Email'),
(7, 'City')
Finally we join all three togther and get a fairly normalised view for the data. The join is a bit messy as your current tables use the same column for both integer based lookups/joins and absolute string values. Hence the CASE in the JOIN
SELECT
t.Name,
a.Caption,
ISNULL(am.[Description], t.AttributeMemberID) as Label
FROM #t t
JOIN #Attributes a on t.AttributeID = a.AttributeID
LEFT JOIN #AttributeMembers am
on t.AttributeID = am.AttributeID
and
CAST(CASE WHEN ISNUMERIC(t.AttributeMemberID) = 0 THEN 0 ELSE t.AttributeMemberID END as int)
= am.AttributeMemberID
ORDER BY Name, Caption, Label
This gives us the following output...
As you can see, this will be easy to put into a Matrix control in SSRS.
Row group by Name, Column Group by Captionand data cell would beLabel`.
If you wanted to ensure the order of the columns, you could extend the Attributes table to include a SortOrder column, include this in the query output and use this in SSRS to order the columns by.
Hope that's clear enough.

SQL Query for exact match in many to many relation

I have the following tables(only listing the required attributes)
medicine (id, name),
generic (id, name),
med_gen (med_id references medicine(id),gen_id references generic(id), potency)
Sample Data
medicine
(1, 'Crocin')
(2, 'Stamlo')
(3, 'NT Kuf')
generic
(1, 'Hexachlorodine')
(2, 'Methyl Benzoate')
med_gen
(1, 1, '100mg')
(1, 2, '50ml')
(2, 1, '100mg')
(2, 2, '60ml')
(3, 1, '100mg')
(3, 2, '50ml')
I want all the medicines which are equivalent to a given medicine. Those medicines are equivalent to each other that have same generic as well as same potency. In the above sample data, all the three have same generics, but only 1 and three also have same potency for the corresponding generics. So 1 and 3 are equivalent medicines.
I want to find out equivalent medicines given a medicine id.
NOTE : One medicine may have any number of generics. Medicine table has around 102000 records, generic table around 2200 and potency table around 200000 records. So performance is a key point.
NOTE 2 : The database used in MySQL.
One way to do it in MySQL is to leverage GROUP_CONCAT() function
SELECT g.med_id
FROM
(
SELECT med_id, GROUP_CONCAT(gen_id ORDER BY gen_id) gen_id, GROUP_CONCAT(potency ORDER BY potency) potency
FROM med_gen
WHERE med_id = 1 -- here 1 is med_id for which you're trying to find analogs
) o JOIN
(
SELECT med_id, GROUP_CONCAT(gen_id ORDER BY gen_id) gen_id, GROUP_CONCAT(potency ORDER BY potency) potency
FROM med_gen
WHERE med_id <> 1 -- here 1 is med_id for which you're trying to find analogs
GROUP BY med_id
) g
ON o.gen_id = g.gen_id
AND o.potency = g.potency
Output:
| MED_ID |
|--------|
| 3 |
Here is SQLFiddle demo

mysql distribution of combinations/values

I have a mysql table which contains some random combination of numbers. For simplicity take the following table as example:
index|n1|n2|n3
1 1 2 3
2 4 10 32
3 3 10 4
4 35 1 2
5 27 1 3
etc
What I want to find out is the number of times a combination has occured in the table. For instance, how many times has the combination of 4 10 or 1 2 or 1 2 3 or 3 10 4 etc occured.
Do I have to create another table that contains all possible combinations and do comparison from there or is there another way to do this?
For a single combination, this is easy:
SELECT COUNT(*)
FROM my_table
WHERE n1 = 3 AND n2 = 10 AND n3 = 4
If you want to do this with multiple combinations, you could create a (temporary) table of them and join that table with you data, something like this:
CREATE TEMPORARY TABLE combinations (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
n1 INTEGER, n2 INTEGER, n3 INTEGER
);
INSERT INTO combinations (n1, n2, n3) VALUES
(1, 2, NULL), (4, 10, NULL), (1, 2, 3), (3, 10, 4);
SELECT c.n1, c.n2, c.n3, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON (c.n1 = t.n1 OR c.n1 IS NULL)
AND (c.n2 = t.n2 OR c.n2 IS NULL)
AND (c.n3 = t.n3 OR c.n3 IS NULL)
GROUP BY c.id;
(demo on SQLize)
Note that this query as written is not very efficient due to the OR c.n? IS NULL clauses, which MySQL isn't smart enough to optimize. If all your combinations contain the same number of terms, you can leave those out, which will allow the query to make use of indexes on the data table.
Ps. With the query above, the combination (1, 2, NULL) won't match (35, 1, 2). However, (NULL, 1, 2) will, so, if you want both, a simple workaround would be to just include both patterns in your table of combinations.
If you actually have many more columns than shown in your example, and you want to match patterns that occur in any set of consecutive columns, then your really should pack your columns into a string and use a LIKE or REGEXP query. For example, if you concatenate all your data columns into a comma-separated string in a column named data, you could search it like this:
INSERT INTO combinations (pattern) VALUES
('1,2'), ('4,10'), ('1,2,3'), ('3,10,4'), ('7,8,9');
SELECT c.pattern, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON CONCAT(',', t.data, ',') LIKE CONCAT('%,', c.pattern, ',%')
GROUP BY c.id;
(demo on SQLize)
You could make this query somewhat faster by making the prefixes and suffixes added with CONCAT() part of the actual data in the tables, but this is still going to be a fairly inefficient query if you have a lot of data to search, because it cannot make use of indexes. If you need to do this kind of substring searching on large datasets efficiently, you may want to use something better suited for than specific purpose than MySQL.
You only have three columns in the table, so you are looking for combinations of 1, 2, and 3 elements.
For simplicity, I'll start with the following table:
select index, n1 as n from t union all
select index, n2 from t union all
select index, n3 from t union all
select distinct index, -1 from t union all
select distinct index, -2 from t
Let's call this "values". Now, we want to get all triples from this table for a given index. In this case, -1 and -2 represent NULL.
select (case when v1.n < 0 then NULL else v1.n end) as n1,
(case when v2.n < 0 then NULL else v2.n end) as n2,
(case when v3.n < 0 then NULL else v3.n end) as n3,
count(*) as NumOccurrences
from values v1 join
values v2
on v1.n < v2.n and v1.index = v2.index join
values v3
on v2.n < v3.n and v2.index = v3.index
This is using the join mechanism to generate the combinations.
This method finds all combinations regardless of ordering (so 1, 2, 3 is the same as 2, 3, 1). Also, this ignores duplicates, so it cannot find (1, 2, 2) if 2 is repeated twice.
SELECT
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10))) AS Combination,
COUNT(CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))) AS Occurrences
FROM
MyTable
GROUP BY
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))
This creates a single column that represents the combination of the values within the 3 columns by concatenating the values. It will count the occurrences of each.

Creating a frequency table in Access VBA

I have a table where different participants are given multiple boxes of medicines on multiple days. I am trying to create a frequency table showing how much medicines have been distributed by the number of boxes to the participants.
The result I'm looking for is -
2 boxes = 1 (since only Lynda got a total of 2 boxes), 4 boxes = 2 (since Ryan and Rinky both got a total of 4 boxes after adding up the medicine boxes)
Please let me know what approach would be the best in this case.
Thanks for your help.
-Nams
I think you want:
SELECT t.SumOf, Count(t.[PARTICIPANT ID]) AS CountOf
FROM (SELECT Table1.[PARTICIPANT ID], Sum(Table1.MEDICINE_BOX) AS SumOf
FROM Table1
GROUP BY Table1.[PARTICIPANT ID]) AS t
GROUP BY t.SumOf;
Where table1 is the name of your table.
If your table is like this:
medicine_dispense
participantID date amount_boxes
ABC 8/29/12 1
ABC 8/30/12 2
XYZ 8/29/12 1
XYZ 8/30/12 1
then a query like this:
select
amount_boxes, count(participantID)
from
medicine_dispense
should work
I'll use generic SQL. You can paste SQL into Access queries in SQL view. (You might have to delete the CHECK() constraint.)
create table participant_meds (
participant varchar(10) not null,
distribution_date date not null default current_date,
num_boxes integer not null check (num_boxes > 0),
primary key (participant, distribution_date)
);
insert into participant_meds values ('Ryan', '2012-02-03', 1);
insert into participant_meds values ('Ryan', '2012-06-07', 3);
insert into participant_meds values ('Rinky', '2012-02-28', 4);
insert into participant_meds values ('Lynda', '2012-03-04', 2);
insert into participant_meds values ('Russ', '2012-04-05', 2);
insert into participant_meds values ('Russ', '2012-05-08', 2);
insert into participant_meds values ('Russ', '2012-06-12', 2);
Resulting data, sorted, for copy/paste.
participant distribution_date num_boxes
Lynda 2012-03-04 2
Rinky 2012-02-28 4
Russ 2012-04-05 2
Russ 2012-05-08 2
Russ 2012-06-12 2
Ryan 2012-02-03 1
Ryan 2012-06-07 3
This query gives you the total boxes per participant.
select sum(num_boxes) boxes, participant
from participant_meds
group by participant;
6;"Russ"
2;"Lynda"
4;"Ryan"
4;"Rinky"
Use that query in the FROM clause as if it were a table. (I'd consider storing that query as a view, because I suspect that the total number of boxes per participant might be useful. Also, Access has historically been good at optimizing queries that use views.)
select boxes num_boxes, count(participant) num_participants
from (select sum(num_boxes) boxes, participant
from participant_meds
group by participant) total_boxes
group by num_boxes
order by num_boxes;
num_boxes num_participants
--
2 1
4 2
6 1

Correct way to store uni/bi/trigrams ngrams in RDBMS?

I have a list of unigrams (single word), bigrams (two words), and trigrams (three words) I have pulled out of a bunch of documents. My goal is a statically analyses report and also a search I can use on these documents.
John Doe
Xeon 5668x
corporate tax rates
beach
tax plan
Porta San Giovanni
The ngrams are tagged by date and document. So for example, I can find relations between bigrams and when their phrases first appeared as well as relations between documents. I can also search for documents that contain these X number of un/bi/trigram phrases.
So my question is how to store them to optimize these searches.
The simplest approach is just a simple string column for each phrase and then I add relations to the document_ngram table each time I find that word/phrase in the document.
table document
{
id
text
date
}
table ngram
{
id
ngram varchar(200);
}
table document_ngram
{
id
ngram_id
document_id
date
}
However, This means that if I want to search through trigrams for a single word I have to use string searching. For example, lets say I wanted all trigrams with the word "summer" in them.
So if I instead split the words up so that the only thing stored in ngram was a single word, then added three columns so that all 1, 2, & 3 word chains could fit inside document_ngram?
table document_ngram
{
id
word1_id NOT NULL
word2_id DEFAULT NULL
word3_id DEFAULT NULL
document_id
date
}
Is this the correct way to do it? Are their better ways? I am currently using PostgreSQL and MySQL but I believe this is a generic SQL question.
This is how I would model your data (note that 'the' is referenced twice) You could also add weights to the single words.
DROP SCHEMA ngram CASCADE;
CREATE SCHEMA ngram;
SET search_path='ngram';
CREATE table word
( word_id INTEGER PRIMARY KEY
, the_word varchar
, constraint word_the_word UNIQUE (the_word)
);
CREATE table ngram
( ngram_id INTEGER PRIMARY KEY
, n INTEGER NOT NULL -- arity
, weight REAL -- payload
);
CREATE TABLE ngram_word
( ngram_id INTEGER NOT NULL REFERENCES ngram(ngram_id)
, seq INTEGER NOT NULL
, word_id INTEGER NOT NULL REFERENCES word(word_id)
, PRIMARY KEY (ngram_id,seq)
);
INSERT INTO word(word_id,the_word) VALUES
(1, 'the') ,(2, 'man') ,(3, 'who') ,(4, 'sold') ,(5, 'world' );
INSERT INTO ngram(ngram_id, n, weight) VALUES
(101, 6, 1.0);
INSERT INTO ngram_word(ngram_id,seq,word_id) VALUES
( 101, 1, 1)
, ( 101, 2, 2)
, ( 101, 3, 3)
, ( 101, 4, 4)
, ( 101, 5, 1)
, ( 101, 6, 5)
;
SELECT w.*
FROM ngram_word nw
JOIN word w ON w.word_id = nw.word_id
WHERE ngram_id = 101
ORDER BY seq;
RESULT:
word_id | the_word
---------+----------
1 | the
2 | man
3 | who
4 | sold
1 | the
5 | world
(6 rows)
Now, suppose you want to add a 4-gram to the existing (6-gram) data:
INSERT INTO word(word_id,the_word) VALUES
(6, 'is') ,(7, 'lost') ;
INSERT INTO ngram(ngram_id, n, weight) VALUES
(102, 4, 0.1);
INSERT INTO ngram_word(ngram_id,seq,word_id) VALUES
( 102, 1, 1)
, ( 102, 2, 2)
, ( 102, 3, 6)
, ( 102, 4, 7)
;
SELECT w.*
FROM ngram_word nw
JOIN word w ON w.word_id = nw.word_id
WHERE ngram_id = 102
ORDER BY seq;
Additional result:
INSERT 0 2
INSERT 0 1
INSERT 0 4
word_id | the_word
---------+----------
1 | the
2 | man
6 | is
7 | lost
(4 rows)
BTW: adding a document-type object to this model will add two additional tables to this model: one for the document, and one for document*ngram. (or in another approach: for document*word) A recursive model would also be a possibility.
UPDATE: the above model will need an additional constraint, which will need triggers (or a rule+ an additional table) to be implemented. Pseudocode:
ngram_word.seq >0 AND ngram_word.seq <= (select ngram.n FROM ngram ng WHERE ng.ngram_id = ngram_word.ngram_id)
One idea would be to modify your original table layout a bit. Consider the ngram varchar(200) column to only contain 1 word of the ngram, add in a word_no (1, 2, or 3) column, and add in a grouping column, so that, for example the two records for the two words in a bigram are related (give them the same word_group). [In Oracle, I'd pull the word_group numbers from a Sequence - I think PostGres would have something similar)
table document
{
id
text
date
}
table ngram
{
id
word_group
word_no
ngram varchar(200);
}
table document_ngram
{
id
ngram_id
document_id
date
}