Joining tables using subqueries - mysql

I have some tables:
tableA (cola1, cola2, cola3, cola4, cola5)
tableB (colb1, colb2)
tableC (colc1, colc2, colc3)
tableA.cola2 refers to tableB.colb1
and tableA.cola3 refers to tableC.colc1
I want to retrieve data from all those tables, I have a query using join like this:
Select tableA.cola1, tableA.cola2, tableA.cola3, tableA.cola4, tableA.cola5, tableB.colb2, tableC.colc2, tableC.colc3
FROM tableA
INNER JOIN tableB ON tableA.cola2 = tableB.colb1
INNER JOIN tableC ON tableA.cola3 = tableC.colc1
WHERE tableA.cola5 = 'something'
So, is it possible to write this query using subqueries instead of JOIN?
and what would be better? subqueries or JOIN?
A friend of mine told me that when you have large tables, JOIN is slow and requires a powerful computer, while subqueries is faster and doens't require a powerful computer to perform the selection. He said it's because subqueries return results based on something like addition, and JOIN reutrn results based on something like multiplication (I'm not good at English so I don't know how to put this, but hope you get the idea). I am new to this and I've tried to google but still can't understand that. Would anybody please spare sometime answer my question and explain this subquery vs JOIN thing to me? Thank you very much.

In most cases JOINs are faster than sub-queries and it is very rare for a sub-query to be faster.
JOINs RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.

Do it this way:
Select cola1, cola2, cola3, cola4, cola5, colb2, colc2, colc3,colb1, colb2,colc1, colc2, colc3 FROM
(Select cola1, cola2, cola3, cola4, cola5, colb2, colc2, colc3,colb1, colb2 FROM
(Select cola1, cola2, cola3, cola4, cola5, colb2, colc2, colc3 FROM tableA) as A
INNER JOIN
(Select colb1, colb2 from tableB) as B
ON A.cola2 = B.colb1)as AB
INNER JOIN
(Select colc1, colc2, colc3 From tableC) as C
ON AB.cola3 = C.colc1
WHERE AB.cola5 = 'something'

Related

GET * Datas from first table connect over same ID

i guess that i ask a kind of question like it was ask a thousand times, before, but i dont understand the part in other questions, i hope someone could explain it me at my simple code.
I have two tables
TableA -> ID|SITEID|NEXT|...
TABLEB -> ID|SITEID|ANOTHER|...
Now i want to catch all results wich are matched by the same SITEID='SITEXY' and TABLEB.ANOTHER='IDXY'. As result i only want to recieve the fields of TABLEA.
At the moment i do it this way, but i get the fields from both tables.
SELECT * FROM TABLEA, TABLEB WHERE TABLEA.SITEID='SITEXY' AND TABLEB.ANOTHER='IDXY' AND TABLEA.SITEID=TABLEB.SITEID;
Mybe its better to use "USING" or "JOIN" but i'm to stupid to understand how it works....:-(
You can qualify the wildcard with the table from which you want to get the rows:
select TABLEA.*
from TABLEA
join TABLEB on TABLEA.SITEID = TABLEB.SITEID
where TABLEA.SITEID = 'SITEXY'
and TABLEB.ANOTHER = 'IDXY';
Also, always use modern explicit join syntax instead of comma based join.
Using aliases, you can make the query bit cleaner:
select a.*
from TABLEA a
join TABLEB b on a.SITEID = b.SITEID
where a.SITEID = 'SITEXY'
and b.ANOTHER = 'IDXY';
Assign aliases to both tables, and then select all columns from TABLEA:
SELECT a.*
FROM TABLEA a
INNER JOIN TABLEB b
ON a.SITEID = b.SITEID
WHERE a.SITEID = 'SITEXY' AND
b.ANOTHER = 'IDXY';
Aliases make it easier to read and write a query. Note that I have also replaced your implicit join with an explicit one using INNER JOIN and ON. As a general rule, you should avoid writing commas in the FROM clause.

SQL query for large amount of data with many joins

I have written a sql query for my requirement.
This is working fine for me. This is taking 0.0006 sec to execute.
I want to know from sql experts "will this work fine with large amount of data?".
I have written my query below.
SELECT HM_customers.id,
HM_customers.username,
HM_customers.firstname,
HM_customers.lastname,
HM_customers.company,
HM_customers_address_bank.field_data
FROM HM_orders
JOIN HM_order_items
ON HM_order_items.order_id = HM_orders.id
JOIN HM_bid
ON HM_order_items.bid_id = HM_bid.bid_id
JOIN HM_customers
ON HM_bid.user_id = HM_customers.id
JOIN HM_customers_address_bank
ON HM_customers_address_bank.id = HM_customers.default_billing_address
WHERE HM_orders.id = '4'
Any expert can advice me or let me know how can I improve this query. Please suggest me if any issue in this query.
NOTE:- This is a simple query. But I want to know, will this work with large amount of data with less time
You don't need to include the orders table:
SELECT c.id,
c.username,
c.firstname,
c.lastname,
c.company,
cb.field_data
FROM HM_order_items oi
JOIN HM_bid b
ON oi.bid_id = b.bid_id
JOIN HM_customers c
ON b.user_id = c.id
JOIN HM_customers_address_bank cb
ON cb.id = c.default_billing_address
WHERE oi.order_id = '4';
Your query can also result in duplicate rows, if a customer bids on the same items multiple times. If you put in a select distinct, then you will incur overhead of duplicate elimination. If this becomes a problem, you will probably want to restructure the query as an exists.
There are few points worth noting
1) The reference to an outer table column in the WHERE clause prevents the OUTER JOIN from returning any non-matched rows, which implicitly converts the query to an INNER JOIN. This is probably a bug in the query or a misunderstanding of how OUTER JOIN works.
2) Selecting all columns with the * wildcard will cause the query's meaning and behavior to change if the table's schema changes, and might cause the query to retrieve too much data. You should only choose columns you need.
Please make your driven table to 'HM_customers' as all your data is coming from this table and change your join like this way, hopefully this will help you :)
SELECT hmCust.id,
hmCust.username,
hmCust.firstname,
hmCust.lastname,
hmCust.company,
hmCustAdd.field_data
FROM HM_customers hmCust
INNER JOIN HM_bid hmBid
ON hmBid.user_id = hmCust.id
INNER JOIN HM_customers_address_bank hmCustAdd
ON hmCustAdd.id = hmCust.default_billing_address
INNER JOIN HM_order_items hmOrderItem
ON hmOrderItem.order_id = hmBid.bid_id
INNER JOIN HM_orders hmOrder
ON hmOrder.id = hmOrderItem.order_id
WHERE hmOrder.id = '4'

SELECT DISTINCT statement in MySQL is taking 10 minutes

I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);

Need help speeding up a MySQL query

I need a query that quickly shows the articles within a particular module (a subset of articles) that a user has NOT uploaded a PDF for. The query I am using below takes about 37 seconds, given there are 300,000 articles in the Article table, and 6,000 articles in the Module.
SELECT *
FROM article a
INNER JOIN article_module_map amm ON amm.article=a.id
WHERE amm.module = 2 AND
a.id NOT IN (
SELECT afm.article
FROM article_file_map afm
INNER JOIN article_module_map amm ON amm.article = afm.article
WHERE afm.organization = 4 AND
amm.module = 2
)
What I am doing in the above query is first truncating the list of articles to the selected module, and then further truncating that list to the articles that are not in the subquery. The subquery is generating a list of the articles that an organization has already uploaded PDF's for. Hence, the end result is a list of articles that an organization has not yet uploaded PDF's for.
Help would be hugely appreciated, thanks in advance!
EDIT 2012/10/25
With #fthiella's help, the below query ran in an astonishing 1.02 seconds, down from 37+ seconds!
SELECT a.* FROM (
SELECT article.* FROM article
INNER JOIN article_module_map
ON article.id = article_module_map.article
WHERE article_module_map.module = 2
) AS a
LEFT JOIN article_file_map
ON a.id = article_file_map.article
AND article_file_map.organization=4
WHERE article_file_map.id IS NULL
I am not sure that i can understand the logic and the structure of the tables correctly. This is my query:
SELECT
article.id
FROM
article
INNER JOIN
article_module_map
ON article.id = article_module_map.article
AND article_module_map.module=2
LEFT JOIN
article_file_map
ON article.id = article_file_map.article
AND article_file_map.organization=4
WHERE
article_file_map.id IS NULL
I extract all of the articles that have a module 2. I then select those that organization 4 didn't provide a file.
I used a LEFT JOIN instead of a subquery. In some circumstances this could be faster.
EDIT Thank you for your comment. I wasn't sure it would run faster, but it surprises me that it is so much slower! Anyway, it was worth a try!
Now, out of curiosity, I would like to try all the combinations of LEFT/INNER JOIN and subquery, to see which one runs faster, eg:
SELECT *
FROM
(SELECT *
FROM
article INNER JOIN article_module_map
ON article.id = article_module_map.article
WHERE
article_module_map.module=2)
LEFT JOIN
etc.
maybe removing *, and I would like to see what changes between the condition on the WHERE clause and on the ON clause... anyway I think it doesn't help much, you should concentrate on indexes now.
Indexes on keys/foreign key should be okay already, but what if you add an index on article_module_map.module and/or article_file_map.organization ?
When optimizing queries I use to check the following points:
First: I would avoid using * in SELECT clause, instead, name the diferent fields you want. This increases crazily the speed (I had one which took 7 seconds with *, and naming the field decreased to 0.1s).
Second: As #Adder says, add indexes to your tables.
Third: Try using INNER JOIN instead of WHERE amm.module = 2 AND a.id NOT IN ( ... ). I think I read (I don't remember it well, so take it carefully) that usually MySQL optimize INNER JOINS, and as your subquery is a filter, maybe using three INNER JOINS plus WHERE would be faster to retrieve.

Do you have to join tables "ON" fields or can you just equate them in there where clause?

We have been doing queries a bunch of different ways and queries have been working when we do a
SELECT t.thing FROM table1 t JOIN table2 s WHERE t.something = s.somethingelse AND t.something = 1
and it worked with all queries except one. This one query was hanging forever and crashes our server, but it apparently works if we do it like:
SELECT t.thing FROM table1 t JOIN table2 s ON t.something = s.somethingelse WHERE t.something = 1
We are trying to figure out if the problem is due to the query structure or due to some corruption in the account we are trying to query.
Is the first syntax correct? Thanks.
You need to use the ON clause. Though you can also join with commas, e.g.: SELECT * FROM table1, table2;
Hope that helps!
There are different ANSI formats.. you can use
Select ...
from tbl1 join tbl2 on tbl1.fld = tbl2.fld
OR
select ...
from tbl1, tbl2
where tbl1.fld = tbl2.fld...
The explicit join is the more common format where you are explicitly showing developers after yourself how the tables are related without respect to filtering criteria.
Your first syntax miss the ON. When you join it is mandatory to tell ON what fields the join is happening.
I would recommend using JOIN ON over WHERE to do your joins.
1) your where clause will be easier to read since it will not be pollute by join where clause.
2) your join section is easier to read and understand.
We all agree both method works, but the JOIN one is better due to theses points.
my 2 cents