Why do we use UNION to join rows in SQL? - mysql

Why use union in Sql at all when there must be same order of columns and it's names in the select statements. Couldn't we just update or alter the table to add more rows ?

The reason that you would not just keep adding rows to a single table is that they don't belong in that table.
For the same reason that if you're doing arithmetic between two integer variables like x + y, you don't permanently add the value of y to x. You want to preserve x's value as it's own thing, even though sometimes you also need the sum.
A book like SQL and Relational Theory: How to Write Accurate SQL Code makes clear that there's a difference between a relation and a relvar. A relvar is like a table. It's a persistent storage of a specific set of rows.
A relation is the result of a SQL expression like SELECT or VALUES. That relation may not be stored in any relvar; it is ephemeral. Perhaps it's the result of a more complex query that uses expressions, joins, and so on.
By analogy, the a number like 42 is an integer value. But int x that stores an integer value 42 is an integer variable. They can both be used as an operand for + but they're not the same kind of thing.
You can UNION two relations, if their columns are compatible in number and data type. Those relations aren't necessarily just relvars, they could be the result of other subqueries.
Just like in arithmetic, you can add x and a whole other integer expression.

Related

Index not used against MySQL SET column

I have a large data table containing details by date and across 3 independent criteria with around 12 discreet values for each criteria. That is, each criteria field in the table is defined as a 12 value ENUM. Users pull summary data by date and any filtering across the three criteria, including none at all. To make a single criteria lookup efficient, 3 separate indexes are required (date,CriteriaA), (date,CriteriaB), (date,CriteriaC). 4 indexes if you want to lookup against any of the 3 (date,A,B,C),(date,A,C),(date,B,C),(date,C).
In an attempt to be more efficient in the lookup, I built a SET column containing all 36 values from the 3 criteria. All values across the criteria are unique and none are a subset of any other. I added an index to this set (date, set_col). Queries against this table using a set lookup fails to take advantage of the index, however. Neither FIND_IN_SET('Value',set_col), set_col LIKE '%Value%', nor set_col & [pos. in set] triggers the index (according to explain and overall resultset return speed).
Is there a trick to indexing SET columns?
I tried queries like
Select Date, count(*)
FROM tbl
where DATE between [Start] and [End]
and FIND_IN_SET('Value',set_col)
group by Date
I would expect it to run nearly as fast as a lookup against the individual criteria column that has an index against it. But instead it runs as fast when just an index against DATE exists. Same number of rows processed according to Explain.
It's not possible to index SET columns for arbitrary queries.
A SET type is basically a bitfield, with one bit set for each of the values defined for your set. You could search for a specific bit pattern in such a bitfield, or you could search for a range of specific bit patterns, or an inequality, etc. But searching for rows where one specific bit is set in the bitfield is not going to be indexable.
FIND_IN_SET() is really searching for a specific bit set in the bitfield. It will not use an index for this predicate. The best you can hope to do for optimization is to have an index that narrows down the examined rows based on the other search term on date. Then among the rows matching the date range, the FIND_IN_SET() will be applied row-by-row.
It's the same problem as searching for substrings. The following predicates will not use an index on the column:
SELECT ... WHERE SUBSTRING(mytext, 5, 8) = 'word'
SELECT ... WHERE LOCATE(mytext, 'word') > 0
SELECT ... WHERE mytext LIKE '%word%'
A conventional index on the data would be alphabetized from the start of the string, not from some arbitrary point in the middle of the string. This is why fulltext indexing was created as an alternative to a simple B-tree index on the whole string value. But there's no special index type for bitfields.
I don't think the SET data type is helping in your case.
You should use your multi-column indexes with permutations of the columns.
Go back to 3 ENUMs. Then have
INDEX(A, date),
INDEX(B, date),
INDEX(C, date)
Those should significantly help with queries like
WHERE A = 'foo' AND date BETWEEN...
and somewhat help for
WHERE A = 'foo' AND date BETWEEN...
AND B = 'bar'
If you will also have queries without A/B/C, then add
INDEX(date)
Note: INDEX(date, A) is no better than INDEX(date) when using a "range". That is, I recommend against the indexes you mentioned.
FIND_IN_SET(), like virtually all other function calls, is not sargable . However enum=const is sargable since it is implemented as a simple integer.
You did not mention
WHERE A IN ('x', 'y') AND ...
That is virtually un-indexable. However, my suggestions are better than nothing.

Understanding the use of bitwise operators in MySQL?

Can someone explain the purpose of using bitwise operators(like BIT_OR) in MySQL queries. For example, if have a table such as following:
What is the purpose of aggregate operation like:
SELECT name FROM table GROUP BY name HAVING BIT_OR(value) = 0;
What exactly does the BIT_OR do? I understand the actual operation of converting two integers to binary and determining if each pair of corresponding digits are either 0 or 1(if at least one of them is a 1), but what happens with varchar or other non-number columns columns? I know for example, SUM aggregate function can give me the sum of a column of each group. Likewise, what does BIT_OR tell me for each group?
**NOTE:**I randomly created the above table and query - it doesn't illustrate any specific problem

Query for multiple conditions in MySQL

I want to be able to query for multiple statements when I have a table that connects the id's from two other tables.
My three tables
destination:
id_destination, name_destination
keyword:
id_keyword, name_keyword
destination_keyword:
id_keyword, id_destination
Where the last one connects ids from the destination- and the keyword table, in order to associate destination with keywords.
A query to get the destination based on keyword would then look like
SELECT destination.name_destination FROM destination
NATURAL JOIN destination_keyword
NATURAL JOIN keyword
WHERE keyword.name_keyword like _keyword_
Is it possible to query for multiple keywords, let's say I wanted to get the destinations that matches all or some of the keywords in the list sunny, ocean, fishing and order by number of matches. How would I move forward? Should I restructure my tables? I am sort of new to SQL and would very much like some input.
Order your table joins starting with keyword and use a count on the number of time the destination is joined:
select
d.id_destination,
d.name_destination,
count(d.id_destination) as matches
from keyword k
join destination_keyword dk on dk.keyword = k.keyword
join destination d on d.id_destination = dk.id_destination
where name_keyword in ('sunny', 'ocean', 'fishing')
group by 1, 2
order by 3 desc
This query assumes that name_keyword values are single words like "sunny".
Using natural joins is not a good idea, because if the table structures change such that two naturally joined tables get altered to have columns the same name added, suddenly your query will stop working. Also by explicitly declaring the join condition, readers of your code will immediately understand how the tables are jones, and can modify it to add non-key conditions as required.
Requiring that only key columns share the same name is also restrictive, because it requires unnatural column names like "name_keyword" instead of simply "name" - the suffix "_keyword" is redundant and adds no value and exists only because your have to have it because you are using natural joins.
Natural joins save hardly any typing (and often cause more typing over all) and impose limitations on join types and names and are brittle.
They are to be avoided.
You can try something like the following:
SELECT dest.name_destination, count(*) FROM destination dest, destination_keyword dest_key, keyword key
WHERE key.id_keyword = dest_key.id_keyword
AND dest_key.id_destination = dest.id_destination
AND key.name_keyword IN ('sunny', 'ocean', 'fishing')
GROUP BY dest.name_destination
ORDER BY count(*), dest.name_destination
Haven't tested it, but if it is not correct it should show you the way to accomplish it.
You can do multiple LIKE statements:
Column LIKE 'value1' OR Column LIKE 'value2' OR ...
Or you could do a regular expression match:
Column LIKE 'something|somtthing|whatever'
The trick to ordering by number of matches has to do with understanding the GROUP BY clause and the ORDER BY clause. You either want one count for everything, or you want one count per something. So for the first case you just use the COUNT function by itself. In the second case you use the GROUP BY clause to "group" somethings/categories that you want counted. ORDER BY should be pretty straight forward.
I think based on the information you have provided your table structure is fine.
Hope this helps.
DISCLAIMER: My syntax isn't accurate.

How to grab rows which contain a dual id reference

I have a messages table:
messages:
id(int)
send_id(int)
receive_id(int)
And I want to be able to select rows from this only when a->b and b->a exist, e.g.:
id send_id recieve_id
0, 15, 16
1, 16, 15
So that basically one message has been passed to each person. How would I be able to go about selecting just one of those two rows (either send or receive), and all of those for a specific id.
I want to only return results that have this duality.
My code currently uses a nested SELECT and doesn't work at all as needed.
You can achieve the result by taking advantage MySQL's LEAST and GREATEST built-in functions.
SELECT *
FROM messages
WHERE (LEAST(send_id, recieve_id), GREATEST(send_id, recieve_id), id)
IN
(
SELECT LEAST(send_id, recieve_id) as x,
GREATEST(send_id, recieve_id) as y,
MAX(id) msg_ID
FROM messages
GROUP BY x, y
);
SQLFiddle Demo
MySQL Comparison Operator (LEAST/GREATEST)
You have to define an additional synthesized column for this. Different alternatives: permanent as an index (fast), temporary if just for a selection once a month or on-the-fly inside the actual query.
Whatever alternative, that column should contain both ids, ordered in a numerical way and concatenated, maybe by some separation character like -. Now when you make a uniqueness restriction to that column only one of the two candidates can be entered into the result, the second one is rejected because it would violate that uniqueness rule.
The trick is the ordered concatenation instead of a normal combined index that would allow both variants due to the different order of ids.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.