Inconsistency with MySQL - USING vs ON [duplicate] - mysql

In a MySQL JOIN, what is the difference between ON and USING()? As far as I can tell, USING() is just more convenient syntax, whereas ON allows a little more flexibility when the column names are not identical. However, that difference is so minor, you'd think they'd just do away with USING().
Is there more to this than meets the eye? If yes, which should I use in a given situation?

It is mostly syntactic sugar, but a couple differences are noteworthy:
ON is the more general of the two. One can join tables ON a column, a set of columns and even a condition. For example:
SELECT * FROM world.City JOIN world.Country ON (City.CountryCode = Country.Code) WHERE ...
USING is useful when both tables share a column of the exact same name on which they join. In this case, one may say:
SELECT ... FROM film JOIN film_actor USING (film_id) WHERE ...
An additional nice treat is that one does not need to fully qualify the joining columns:
SELECT film.title, film_id -- film_id is not prefixed
FROM film
JOIN film_actor USING (film_id)
WHERE ...
To illustrate, to do the above with ON, we would have to write:
SELECT film.title, film.film_id -- film.film_id is required here
FROM film
JOIN film_actor ON (film.film_id = film_actor.film_id)
WHERE ...
Notice the film.film_id qualification in the SELECT clause. It would be invalid to just say film_id since that would make for an ambiguity:
ERROR 1052 (23000): Column 'film_id' in field list is ambiguous
As for select *, the joining column appears in the result set twice with ON while it appears only once with USING:
mysql> create table t(i int);insert t select 1;create table t2 select*from t;
Query OK, 0 rows affected (0.11 sec)
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
Query OK, 1 row affected (0.19 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> select*from t join t2 on t.i=t2.i;
+------+------+
| i | i |
+------+------+
| 1 | 1 |
+------+------+
1 row in set (0.00 sec)
mysql> select*from t join t2 using(i);
+------+
| i |
+------+
| 1 |
+------+
1 row in set (0.00 sec)
mysql>

Thought I would chip in here with when I have found ON to be more useful than USING. It is when OUTER joins are introduced into queries.
ON benefits from allowing the results set of the table that a query is OUTER joining onto to be restricted while maintaining the OUTER join. Attempting to restrict the results set through specifying a WHERE clause will, effectively, change the OUTER join into an INNER join.
Granted this may be a relative corner case. Worth putting out there though.....
For example:
CREATE TABLE country (
countryId int(10) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
country varchar(50) not null,
UNIQUE KEY countryUIdx1 (country)
) ENGINE=InnoDB;
insert into country(country) values ("France");
insert into country(country) values ("China");
insert into country(country) values ("USA");
insert into country(country) values ("Italy");
insert into country(country) values ("UK");
insert into country(country) values ("Monaco");
CREATE TABLE city (
cityId int(10) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
countryId int(10) unsigned not null,
city varchar(50) not null,
hasAirport boolean not null default true,
UNIQUE KEY cityUIdx1 (countryId,city),
CONSTRAINT city_country_fk1 FOREIGN KEY (countryId) REFERENCES country (countryId)
) ENGINE=InnoDB;
insert into city (countryId,city,hasAirport) values (1,"Paris",true);
insert into city (countryId,city,hasAirport) values (2,"Bejing",true);
insert into city (countryId,city,hasAirport) values (3,"New York",true);
insert into city (countryId,city,hasAirport) values (4,"Napoli",true);
insert into city (countryId,city,hasAirport) values (5,"Manchester",true);
insert into city (countryId,city,hasAirport) values (5,"Birmingham",false);
insert into city (countryId,city,hasAirport) values (3,"Cincinatti",false);
insert into city (countryId,city,hasAirport) values (6,"Monaco",false);
-- Gah. Left outer join is now effectively an inner join
-- because of the where predicate
select *
from country left join city using (countryId)
where hasAirport
;
-- Hooray! I can see Monaco again thanks to
-- moving my predicate into the ON
select *
from country co left join city ci on (co.countryId=ci.countryId and ci.hasAirport)
;

Wikipedia has the following information about USING:
The USING construct is more than mere syntactic sugar, however, since
the result set differs from the result set of the version with the
explicit predicate. Specifically, any columns mentioned in the USING
list will appear only once, with an unqualified name, rather than once
for each table in the join. In the case above, there will be a single
DepartmentID column and no employee.DepartmentID or
department.DepartmentID.
Tables that it was talking about:
The Postgres documentation also defines them pretty well:
The ON clause is the most general kind of join condition: it takes a
Boolean value expression of the same kind as is used in a WHERE
clause. A pair of rows from T1 and T2 match if the ON expression
evaluates to true.
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b =
T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.

Database tables
To demonstrate how the USING and ON clauses work, let's assume we have the following post and post_comment database tables, which form a one-to-many table relationship via the post_id Foreign Key column in the post_comment table referencing the post_id Primary Key column in the post table:
The parent post table has 3 rows:
| post_id | title |
|---------|-----------|
| 1 | Java |
| 2 | Hibernate |
| 3 | JPA |
and the post_comment child table has the 3 records:
| post_comment_id | review | post_id |
|-----------------|-----------|---------|
| 1 | Good | 1 |
| 2 | Excellent | 1 |
| 3 | Awesome | 2 |
The JOIN ON clause using a custom projection
Traditionally, when writing an INNER JOIN or LEFT JOIN query, we happen to use the ON clause to define the join condition.
For example, to get the comments along with their associated post title and identifier, we can use the following SQL projection query:
SELECT
post.post_id,
title,
review
FROM post
INNER JOIN post_comment ON post.post_id = post_comment.post_id
ORDER BY post.post_id, post_comment_id
And, we get back the following result set:
| post_id | title | review |
|---------|-----------|-----------|
| 1 | Java | Good |
| 1 | Java | Excellent |
| 2 | Hibernate | Awesome |
The JOIN USING clause using a custom projection
When the Foreign Key column and the column it references have the same name, we can use the USING clause, like in the following example:
SELECT
post_id,
title,
review
FROM post
INNER JOIN post_comment USING(post_id)
ORDER BY post_id, post_comment_id
And, the result set for this particular query is identical to the previous SQL query that used the ON clause:
| post_id | title | review |
|---------|-----------|-----------|
| 1 | Java | Good |
| 1 | Java | Excellent |
| 2 | Hibernate | Awesome |
The USING clause works for Oracle, PostgreSQL, MySQL, and MariaDB. SQL Server doesn't support the USING clause, so you need to use the ON clause instead.
The USING clause can be used with INNER, LEFT, RIGHT, and FULL JOIN statements.
SQL JOIN ON clause with SELECT *
Now, if we change the previous ON clause query to select all columns using SELECT *:
SELECT *
FROM post
INNER JOIN post_comment ON post.post_id = post_comment.post_id
ORDER BY post.post_id, post_comment_id
We are going to get the following result set:
| post_id | title | post_comment_id | review | post_id |
|---------|-----------|-----------------|-----------|---------|
| 1 | Java | 1 | Good | 1 |
| 1 | Java | 2 | Excellent | 1 |
| 2 | Hibernate | 3 | Awesome | 2 |
As you can see, the post_id is duplicated because both the post and post_comment tables contain a post_id column.
SQL JOIN USING clause with SELECT *
On the other hand, if we run a SELECT * query that features the USING clause for the JOIN condition:
SELECT *
FROM post
INNER JOIN post_comment USING(post_id)
ORDER BY post_id, post_comment_id
We will get the following result set:
| post_id | title | post_comment_id | review |
|---------|-----------|-----------------|-----------|
| 1 | Java | 1 | Good |
| 1 | Java | 2 | Excellent |
| 2 | Hibernate | 3 | Awesome |
You can see that this time, the post_id column is deduplicated, so there is a single post_id column being included in the result set.
Conclusion
If the database schema is designed so that Foreign Key column names match the columns they reference, and the JOIN conditions only check if the Foreign Key column value is equal to the value of its mirroring column in the other table, then you can employ the USING clause.
Otherwise, if the Foreign Key column name differs from the referencing column or you want to include a more complex join condition, then you should use the ON clause instead.

For those experimenting with this in phpMyAdmin, just a word:
phpMyAdmin appears to have a few problems with USING. For the record this is phpMyAdmin run on Linux Mint, version: "4.5.4.1deb2ubuntu2", Database server: "10.2.14-MariaDB-10.2.14+maria~xenial - mariadb.org binary distribution".
I have run SELECT commands using JOIN and USING in both phpMyAdmin and in Terminal (command line), and the ones in phpMyAdmin produce some baffling responses:
1) a LIMIT clause at the end appears to be ignored.
2) the supposed number of rows as reported at the top of the page with the results is sometimes wrong: for example 4 are returned, but at the top it says "Showing rows 0 - 24 (2503 total, Query took 0.0018 seconds.)"
Logging on to mysql normally and running the same queries does not produce these errors. Nor do these errors occur when running the same query in phpMyAdmin using JOIN ... ON .... Presumably a phpMyAdmin bug.

Short answer:
USING: when clause is ambiguous
ON: when clause has different comparison params

Related

Remove Purge duplicate/multiplicate records from mariadb

Briefly: database imported from foreign source, so I cannot prevent duplicates, I can only prune and clean the database.
Foreign db changes daily, so, I want to automate the pruning process.
It resides on:
MariaDB v10.4.6 managed predominantly by phpMyadmin GUI v4.9.0.1 (both pretty much up to date as of this writing).
This is a radio browsing database.
It has multiple columns, but for me there are only few important:
StationID (it is unique entry number, thus db does not consider new entries as duplicates, all of them are unique because of this primary key)
There are no row numbers.
Name, url, home-page, country, etc
I do want to remove multiple url duplicated entries base on:
duplicate url has country to it, but some country values are NULL (=empty)
so I do want remove all duplicates except one containing country name, if there is one entry with it, if there is none, just one url, regardless of name (names are multilingual, so some duplicated urls have also various names, which I do not care for.
StationID (unique number, but not consecutive, also this is primary db key)
Name (variable, least important)
url (variable, but I do want to remove the duplicates)
country (variable, frequently NULL/empty, I want to eliminate those with empty entries as much as possible, if possible)
One url has to stay by any means (not to be deleted)
I have tried multitude of queries, some work for SELECT, but do NOT for DELETE, some hang my machine when executed. Here are some queries I tried (remember I use MariaDB, not oracle, or ms-sql)
SELECT * from `radio`.`Station`
WHERE (`radio`.`Station`.`Url`, `radio`.`Station`.`Name`) IN (
SELECT `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
HAVING COUNT(*) > 1)
This one should show all entries (not only one grouped), but this query hangs my machine
This query gets me as close as possible:
SELECT *
FROM `radio`.`Station`
WHERE `radio`.`Station`.`StationID` NOT IN (
SELECT MAX(`radio`.`Station`.`StationID`)
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`,`radio`.`Station`.`Name`,`radio`.`Station`.`Country`)
However this query lists more entries:
SELECT *, COUNT(`radio`.`Station`.`Url`) FROM `radio`.`Station` GROUP BY `radio`.`Station`.`Name`,`radio`.`Station`.`Url` HAVING (COUNT(`radio`.`Station`.`Url`) > 1);
But all of these queries group them and display only one row.
I also tried UNION, INNER JOIN, but failed.
WITH cte AS..., but phpMyadmin does NOT like this query, and mariadb cli also did not like it.
I also tried something of this kind, published at oracle blog, which did not work, and I really had no clue what was what in this function:
select *
from (
select f.*,
count(*) over (
partition by `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
) ct
from `radio`.`Station` f
)
where ct > 1
I did not know what f.* was, query did not like ct.
Given
drop table if exists radio;
create table radio
(stationid int,name varchar(3),country varchar(3),url varchar(3));
insert into radio values
(1,'aaa','uk','a/b'),
(2,'bbb','can','a/b'),
(3,'bbb',null,'a/b'),
(4,'bbb',null,'b/b'),
(5,'bbb',null,'b/b');
You could give the null countries a unique value (using coalesce), fortunately stationid is unique so:
select t.stationid,t.name,t.country,t.url
from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry= coalesce(t.country,t.stationid);
Yields
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Translated to a delete
delete t from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry <> coalesce(t.country,t.stationid);
MariaDB [sandbox]> select * from radio;
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Fix 2 problems at once:
Dup rows already in table
Dup rows can still be put in table
Do this fore each table:
CREATE TABLE new LIKE real;
ALTER TABLE new ADD UNIQUE(x,y); -- will prevent future dups
INSERT IGNORE INTO new -- IGNORE dups
SELECT * FROM real;
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

2 inner joins between same 2 tables

I am trying to select columns from 2 tables,
The INNER JOIN conditions are $table1.idaction_url=$table2.idaction AND $table1.idaction_name=$table2.idaction.
However, From the query below, there is no output. It seems like the INNER JOIN can only take 1 condition. If I put AND to include both conditions as shown in the query below, there wont be any output. Please look at the picture below. Please advice.
$mysql=("SELECT conv(hex($table1.idvisitor), 16, 10) as visitorId,
$table1.server_time, $table1.idaction_url,
$table1.time_spent_ref_action,$table2.name,
$table2.type, $table1.idaction_name, $table2.idaction
FROM $table1
INNER JOIN $table2
ON $table1.idaction_url=$table2.idaction
AND $table1.idaction_name=$table2.idaction
WHERE conv(hex(idvisitor), 16, 10)='".$id."'
ORDER BY server_time DESC");
Short answer:
You need to use two separate inner joins, not only a single join.
E.g.
SELECT `actionurls`.`name` AS `actionUrl`, `actionnames`.`name` AS `actionName`
FROM `table1`
INNER JOIN `table2` AS `actionurls` ON `table1`.`idaction_url` = `actionurls`.`idaction`
INNER JOIN `table2` AS `actionnames` ON `table1`.`idaction_name` = `actionurls`.`idaction`
(Modify this query with any additional fields you want to select).
In depth: INNER JOIN, when done on a value unique to the second table (the table joined to the first in this operation) will only ever fetch one row. What you want to do is fetch data from the other table twice, into the same row, reading the select part of the statement.
INNER JOIN table2 ON [comparison] will, for each row selected from table1, grab any rows from table2 for which [comparison] is TRUE, then copy the row from table1 N times, where N is the amount of rows found in table2. If N = 0, then the row is skipped. In our case N=1 so INNER JOIN of idaction_name in table1 to idaction in table2 for example will allow you to select all the action names.
In order to get the action urls as well we have to INNER JOIN a second time. Now you can't join the same table twice normally, as SQL won't know which of the two joined tables is meant when you type table2.name in the first part of your query. This would be ambiguous if both had the same name. There's a solution for this, table aliases.
The output (of my answer above) is going to be something like:
+-----+------------------------+-------------------------+
| Row | actionUrl | actionName |
+-----+------------------------+-------------------------+
| 1 | unx.co.jp/ | UNIX | Kumamoto Home |
| 2 | unx.co.jp/profile.html | UNIX | Kumamoto Profile |
| ... | ... | ... |
+-----+------------------------+-------------------------+
While if you used only a single join, you would get this kind of output (using OR):
+-----+-------------------------+
| Row | actionUrl |
+-----+-------------------------+
| 1 | unx.co.jp/ |
| 2 | UNIX | Kumamoto Home |
| 3 | unx.co.jp/profile.html |
| 4 | UNIX | Kumamoto Profile |
| ... | ... |
+-----+-------------------------+
Using AND and a single join, you only get output if idaction_name == idaction_url is TRUE. This is not the case, so there's no output.
If you want to know more about how to use JOINS, consult the manual about them.
Sidenote
Also, I can't help but notice you're using variables (e.g. $table1) that store the names of the tables. Do you make sure that those values do not contain user input? And, if they do, do you at least whitelist a list of tables that users can access? You may have some security issues with this.
INNER JOIN does not put any restriction on number of conditions it can have.
The zero resultant rows means, there is no rows satisfying the two conditions simultaneously.
Make sure you are joining using correct columns. Try going step by step to identify from where the data is lost

Sql query how to make it faster?

I have an SQL query. Is it possible to change somehow this query, so it has better performance, but with the same result? This query is working, but it is very slow, and I don't have an idea on improving its performance.
SELECT keyword, query
FROM url_alias ua
JOIN product p
on (p.manufacturer_id =
CONVERT(SUBSTRING_INDEX(ua.query,'=',-1),UNSIGNED INTEGER))
JOIN oc_product_to_storebk ps
on (p.product_id = ps.product_id)
AND ua.query LIKE 'manufacturer_id=%'
AND ps.store_id= '9'
GROUP BY ua.keyword
Table structure:
URL_ALIAS
+-----------------------------------------------+
| url_alias_id | query | keyword |
+--------------+---------------------+----------+
| 1 | manufacturer_id=100 | test |
+--------------+---------------------+----------+
PRODUCT
+-----------------+------------+
| manufacturer_id | product_id |
+-----------------+------------+
| 100 | 1000 |
+-----------------+------------+
OC_PRODUCT_TO_STOREBK
+------------+----------+
| product_id | store_id |
+------------+----------+
| 1000 | 9 |
+------------+----------+
I want all the keywords from the url_alias keyword column, when the following condition is met: LIKE 'manufacturer_id=%' AND ps.store_id='9'
You should avoid the convert function as it will be expensive and provides no way you could profit from indexes on the url_alias table.
Extend your url_alias table so it has additional fields for the parts of the query. You will probably hesitate to go this way, but you will not regret it once you have done it. So your url_alias table should look like this:
create table url_alias (
url_alias_id int,
query varchar(200),
keyword varchar(100),
query_key varchar(200),
query_value_str varchar(200),
query_value_int int
);
If you don't want to recreate it, then add the fields as follows:
alter table url_alias add (
query_key varchar(200),
query_value_str varchar(200),
query_value_int int
);
Update these new columns for the existing records with this statement (only to execute once):
update url_alias
set query_key = substring_index(query, '=', 1),
query_value_str = substring_index(query, '=', -1),
query_value_int = nullif(
convert(substring_index(query,'=',-1),unsigned integer), 0);
Then create a trigger so that these 3 extra fields are updated automatically when you insert a new record:
create trigger ins_sum before insert on url_alias
for each row
set new.query_key = substring_index(new.query, '=', 1),
new.query_value_str = substring_index(new.query, '=', -1),
new.query_value_int = nullif(
convert(substring_index(new.query,'=',-1),unsigned integer), 0);
Note the additional nullif() which will make sure the last field is null when the value after the equal sign is not numerical.
If ever you also update such records, then also create a similar update trigger.
With this set-up, you can still insert records like before:
insert into url_alias (url_alias_id, query, keyword)
values (1, 'manufacturer_id=100', 'test');
When you then select this record, you will see this:
+--------------+---------------------+---------+-----------------+-----------------+-----------------+
| url_alias_id | query | keyword | query_key | query_value_str | query_value_int |
+--------------+---------------------+---------+-----------------+-----------------+-----------------+
| 1 | manufacturer_id=100 | test | manufacturer_id | 100 | 100 |
+--------------+---------------------+---------+-----------------+-----------------+-----------------+
Now the work of extraction and conversion has been done once, and does not have to be repeated any more when you select records. You can rewrite your original query like this:
select ua.keyword, ua.query
from url_alias ua
join product p
on p.manufacturer_id = ua.query_value_int
join oc_product_to_storebk ps
on p.product_id = ps.product_id
and ua.query_key = 'manufacturer_id'
and ps.store_id = 9
group by ua.keyword, ua.query
And now you can improve the performance by creating indexes on both elements of the query:
create index query_key on url_alias(query_key, query_value_int, keyword);
You might need to experiment a bit to get the order of fields right in the index before it gets used by the SQL plan.
See this SQL fiddle.
I asume that you use indexes on the store_id, product_id and keyword columns?
Focus on changing your datamodel to avoid the CONVERT and the LIKE operators. Both of them will cause that the query will not utilize indexes on relevant columns
Also, take a good look on the data that is stored in the ua.query colomn. Possibly you need to distribute data in that column to multiple columns so you can use indexes

not sure why join query is returning resultset longer than i'd expect, and taking long to execute

I have reached an impasse with my knowledge regarding mysql joins, and the query I'm trying to execute is taking way too long... Although I'm only a short while into learning mysql on my own, I have put time into reading about the mechanics of indexes and joins, done many google searches and tried a few different query formats. To no avail, I need help please.
Firstly, I will say that my database is, at the moment, to be optimized for speed of select queries. I know I have a few too many indexes... my theory of learning mysql is to make a few too many indexes and examine what the mysql optimizer chooses for my purposes (determined by using explain) and then determine why it has chosen said index.
Anyhow, I have four tables: table1, table2, table3, table4...
table1.ID1 is the primary key, and other data in table1 might be divided into multiple content in table2.
table2.ID1 identifies every entry in table1 that is built upon content form table1
table2.ID2 is the primary key for table2
table3.ID2 identifies every entry in table3 that is built upon content form table2
table3.ID3 is the primary key for table3
table4.ID3 identifies every entry in table4 that is built upon content form table3
Not every entry in table1 has corresponding data in table2, and similarly table2 to table3, and table3 to table4.
What I need to do is retrieve the distinct values of ID2 that appear within a date range, and also only if the table2 content eventually appears in table4. The challenge I'm facing is that only table1 has a date column, and I need only entries that also appear in table4.
The following query takes approx 2 minutes.
select table2.ID2 from table1
left join table2 on
table1.ID1 = table2.ID1
left join table3 on
table3.ID2 = table2.ID2
left join table4 on
table4.ID3 = table3.ID3
where table1.Date between "2012-03-11" and "2012-03-18
by using explain with the above query I see no reason why it should take so long.
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| 1 | SIMPLE | table1 | range | ... | Datekey | 9 | NULL | 17528 | Using where; Using index |
| 1 | SIMPLE | table2 | ref | ... | ID1key | 8 | mydata.table1.POSTID | 1 | |
| 1 | SIMPLE | table3 | ref | ... | ID2key | 8 | mydata.table2.SrcID | 20 | |
| 1 | SIMPLE | table4 | ref | ... | ID3key | 8 | mydata.table3.ParsedID | 10 | Using index |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
I've replaced the names of possible keys with '...' as its not that important. In any case, a key is selected.
Moreover, the number of rows in the resultset in the query is much more than the purported matching 17528 rows in the explain resultset. How could it be more??
What am I doing wrong? I've also tried inner join with no luck. The way I interpret my query is a 4-way venn diagram, with very few number of rows with overlapping criteria, and further optimized by an index on the daterange.
I at least get the resultset that i want if I add 'distinct(table2.ID2)', but why am I otherwise getting a resultset much longer than what I'd expect, and why is it taking so long?
Sorry if any part of my question has been ambiguous, I'd be happy to clarify as needed.
Thanks,
Brian
EDIT:
All indexes refer to a BIGINT column, as I expect my database to get rather large and need quite a number of unique row identifiers... perhaps bigint is overkill and reducing the size of that column and/or the index would speed things up further.
Here's my final solution, based on the accepted answer below:
select ID2 from table2
where exists
(select 1 from table1 r
where table1.Date between "2012-03-11" and "2012-03-18" and table2.ID1 = table1.ID1
)
and exists
(select 1 from table3
where exists
(select 1 from table4 where table4.ID3 = table3.ID3)
)
Additionally, I realized I was missing a multi-field index, associating table2.ID1 and table2.ID2... After adding this index, this statement returns in about 11 seconds, and returns approx 20,000 rows.
I think this is reasonable considering the number of rows in each of my tables
table1: ~480,000
table2: ~480,000
table3: ~6,000,000
table4: ~60,000,000
Does this sound efficient? I'll accept the answer after I get confirmation this is the best performance I should expect. I'm running on a Xeon 3GHz system with 3gb mem, ubuntu 12.04, mysql 5.5.24
In all likelihood, your tables have multiple matches between them. Say table1 matches 5 rows in table2 and 10 rows in table3. Then you end up with 50 rows in the output.
So solve this, you need to limit your joins to one row per table.
One way is to use the in clause. If you are using the joins for filtering, then you can use a where clause instead:
where table2.id1 in (select table1.id1 from table1)
The "in" prevents duplicates.
The other alternative is to pre-aggregate the queries in the joins by doing joins.
Mysql seems to prefer a slightly different construct for the where clause, from an optimization perspective:
where exists (select 1 from table1 where table1.id = table2.id)

WHERE vs HAVING

Why do you need to place columns you create yourself (for example select 1 as "number") after HAVING and not WHERE in MySQL?
And are there any downsides instead of doing WHERE 1 (writing the whole definition instead of a column name)?
All other answers on this question didn't hit upon the key point.
Assume we have a table:
CREATE TABLE `table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`value` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `value` (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And have 10 rows with both id and value from 1 to 10:
INSERT INTO `table`(`id`, `value`) VALUES (1, 1),(2, 2),(3, 3),(4, 4),(5, 5),(6, 6),(7, 7),(8, 8),(9, 9),(10, 10);
Try the following 2 queries:
SELECT `value` v FROM `table` WHERE `value`>5; -- Get 5 rows
SELECT `value` v FROM `table` HAVING `value`>5; -- Get 5 rows
You will get exactly the same results, you can see the HAVING clause can work without GROUP BY clause.
Here's the difference:
SELECT `value` v FROM `table` WHERE `v`>5;
The above query will raise error: Error #1054 - Unknown column 'v' in 'where clause'
SELECT `value` v FROM `table` HAVING `v`>5; -- Get 5 rows
WHERE clause allows a condition to use any table column, but it cannot use aliases or aggregate functions.
HAVING clause allows a condition to use a selected (!) column, alias or an aggregate function.
This is because WHERE clause filters data before select, but HAVING clause filters resulting data after select.
So put the conditions in WHERE clause will be more efficient if you have many many rows in a table.
Try EXPLAIN to see the key difference:
EXPLAIN SELECT `value` v FROM `table` WHERE `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | table | range | value | value | 4 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
EXPLAIN SELECT `value` v FROM `table` having `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | table | index | NULL | value | 4 | NULL | 10 | Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
You can see either WHERE or HAVING uses index, but the rows are different.
Why is it that you need to place columns you create yourself (for example "select 1 as number") after HAVING and not WHERE in MySQL?
WHERE is applied before GROUP BY, HAVING is applied after (and can filter on aggregates).
In general, you can reference aliases in neither of these clauses, but MySQL allows referencing SELECT level aliases in GROUP BY, ORDER BY and HAVING.
And are there any downsides instead of doing "WHERE 1" (writing the whole definition instead of a column name)
If your calculated expression does not contain any aggregates, putting it into the WHERE clause will most probably be more efficient.
The main difference is that WHERE cannot be used on grouped item (such as SUM(number)) whereas HAVING can.
The reason is the WHERE is done before the grouping and HAVING is done after the grouping is done.
HAVING is used to filter on aggregations in your GROUP BY.
For example, to check for duplicate names:
SELECT Name FROM Usernames
GROUP BY Name
HAVING COUNT(*) > 1
These 2 will be feel same as first as both are used to say about a condition to filter data. Though we can use ‘having’ in place of ‘where’ in any case, there are instances when we can’t use ‘where’ instead of ‘having’. This is because in a select query, ‘where’ filters data before ‘select’ while ‘having’ filter data after ‘select’. So, when we use alias names that are not actually in the database, ‘where’ can’t identify them but ‘having’ can.
Ex: let the table Student contain student_id,name, birthday,address.Assume birthday is of type date.
SELECT * FROM Student WHERE YEAR(birthday)>1993; /*this will work as birthday is in database.if we use having in place of where too this will work*/
SELECT student_id,(YEAR(CurDate())-YEAR(birthday)) AS Age FROM Student HAVING Age>20;
/*this will not work if we use ‘where’ here, ‘where’ don’t know about age as age is defined in select part.*/
WHERE filters before data is grouped, and HAVING filters after data is grouped. This is an important distinction; rows that are
eliminated by a WHERE clause will not be included in the group. This
could change the calculated values which, in turn(=as a result) could affect which
groups are filtered based on the use of those values in the HAVING
clause.
And continues,
HAVING is so similar to WHERE that most DBMSs treat them as the same
thing if no GROUP BY is specified. Nevertheless, you should make that
distinction yourself. Use HAVING only in conjunction with GROUP BY
clauses. Use WHERE for standard row-level filtering.
Excerpt From:
Forta, Ben. “Sams Teach Yourself SQL in 10 Minutes (5th
Edition) (Sams Teach Yourself...).”.
Having is only used with aggregation but where with non aggregation statements
If you have where word put it before aggregation (group by)