I'd like to do a left join using only certain rows of the first table in mysql. Currently I do something like:
SELECT students.* FROM students
LEFT JOIN courses
ON students.id = courses.id
WHERE students.id = 6
But will mysql first select rows from table1 (students) satisfying students.id = 6, before doing the left join?
If not, is there a way to force mysql do to so?
Thanks.
Yes there is, try this:
SELECT students.* FROM students
LEFT JOIN courses
ON students.id = courses.id
HAVING students.id = 6
LIMIT 1
It seems to me that you are trying to optimize for the DB. If the query is not slow, and your testing data set is a reasonable proximation of the production DB, then it is not best practice to do so, for many reasons.
With that said, (maybe I am not right, so ignore me)
I think your trying to say, "How can i speed things up, by limiting the number of rows that the DB looks at during the query"
The best way to do this, is ensure you have proper indexes on the tables, on the fields that your going to query. Indexes are extremely fast. If you do not already index the .id fields of both tables, add those indexes to the DB, and that may solve your issue.
Related
I am tying to execute this query but it is taking more than 5 hours, but the data base size is just 20mb. this is my code. Here I am joining 11 tables with reg_id. I need all columns with distinct values. Please guide me how to rearrange the query.
SELECT *
FROM degree
JOIN diploma
ON degree.reg_id = diploma.reg_id
JOIN further_studies
ON diploma.reg_id = further_studies.reg_id
JOIN iti
ON further_studies.reg_id = iti.reg_id
JOIN personal_info
ON iti.reg_id = personal_info.reg_id
JOIN postgraduation
ON personal_info.reg_id = postgraduation.reg_id
JOIN puc
ON postgraduation.reg_id = puc.reg_id
JOIN skills
ON puc.reg_id = skills.reg_id
JOIN sslc
ON skills.reg_id = sslc.reg_id
JOIN license
ON sslc.reg_id = license.reg_id
JOIN passport
ON license.reg_id = passport.reg_id
GROUP BY fullname
Please help me if I did any mistake
This is a bit long for a comment.
The first problem with your query is that you are using select * with group by fullname. You have zillions of columns in the select that are not in the group by. Unless you really, really, really know what you are doing (which I doubt), this is the wrong way to write a query.
Your performance problem is undoubtedly due to cartesian products and lack of indexes. You are joining across different dimensions -- such as skills and degrees. The result is a product of all the possibilities. For some people, the data size can grow and grow and grow.
And then, the question is: do you have indexes on the keys used in the joins? For performance, you generally want such indexes.
I thought the problem is in the query.First make sure group by fullname and try to give some column names instead of *.
I did not write this query. I am working on someone else's old code. I am looking into changing what is needed for this query but if I could simply speed up this query that would solve my problem temporarily. I am looking at adding indexes. when I did a show indexes there are so many indexes on the table orders can that also slow down a query?
I am no database expert. I guess I will learn more from this effort. :)
SELECT
orders.ORD_ID,
orders.ORD_TotalAmt,
orders.PAYMETH_ID,
orders.SCHOOL_ID,
orders.ORD_AddedOn,
orders.AMAZON_PurchaseDate,
orders.ORDSTATUS_ID,
orders.ORD_InvoiceNumber,
orders.ORD_CustFirstName,
orders.ORD_CustLastName,
orders.AMAZON_ORD_ID,
orders.ORD_TrackingNumber,
orders.ORD_SHIPPINGCNTRY_ID,
orders.AMAZON_IsExpedited,
orders.ORD_ShippingStreet1,
orders.ORD_ShippingStreet2,
orders.ORD_ShippingCity,
orders.ORD_ShippingStateProv,
orders.ORD_ShippingZipPostalCode,
orders.CUST_ID,
orders.ORD_ShippingName,
orders.AMAZON_ShipOption,
orders.ORD_ShipLabelGenOn,
orders.ORD_SHIPLABELGEN,
orders.ORD_AddressVerified,
orders.ORD_IsResidential,
orderstatuses.ORDSTATUS_Name,
paymentmethods.PAYMETH_Name,
shippingoptions.SHIPOPT_Name,
SUM(orderitems.ORDITEM_Qty) AS ORD_ItemCnt,
SUM(orderitems.ORDITEM_Weight * orderitems.ORDITEM_Qty) AS ORD_ItemTotalWeight
FROM
orders
LEFT JOIN orderstatuses ON
orders.ORDSTATUS_ID = orderstatuses.ORDSTATUS_ID
LEFT JOIN orderitems ON
orders.ORD_ID = orderitems.ORD_ID
LEFT JOIN paymentmethods ON
orders.PAYMETH_ID = paymentmethods.PAYMETH_ID
LEFT JOIN shippingoptions ON
orders.SHIPOPT_ID = shippingoptions.SHIPOPT_ID
WHERE
(orders.AMAZON_ORD_ID IS NOT NULL AND (orders.ORD_SHIPLABELGEN IS NULL OR orders.ORD_SHIPLABELGEN = '') AND orderstatuses.ORDSTATUS_ID <> 101 AND orderstatuses.ORDSTATUS_ID <> 40)
GROUP BY
orders.ORD_ID,
orders.ORD_TotalAmt,
orders.PAYMETH_ID,
orders.SCHOOL_ID,
orders.ORD_AddedOn,
orders.ORDSTATUS_ID,
orders.ORD_InvoiceNumber,
orders.ORD_CustFirstName,
orders.ORD_CustLastName,
orderstatuses.ORDSTATUS_Name,
paymentmethods.PAYMETH_Name,
shippingoptions.SHIPOPT_Name
ORDER BY
orders.ORD_ID
One simple thing you should consider is whether you really need to use left joins or you would be satisfied using inner joins for some of the joins. the new query would not be the same as the original query, so you would need to think carefully about what you really want back. If your foreign key relationships are indexed correctly, this could help substantially, especially between ORDERS and ORDERITEMS, because I would imagine these are your largest tables. The following post has a good explanation: INNER JOIN vs LEFT JOIN performance in SQL Server. There are lots of other things that can be done, but you will need to post the query plan so people can dive deeper.
It looks like just adding the index was all that was needed.
create index orderitems_ORD_ID_index on orderitems(ORD_ID);
I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);
I have a real mindbender of a MySQL problem which I am now thinking there is no answer to. Please help me, you are my only hope!
Stripping it down to the basics, I have two tables, "People" and "Activity". It is possible (long story and lots of data involved) for these two tables to be joined by two different relationship tables: people_activity and entity_activity
I need to do a query on the activity table which gets the people record/s linked to activity records based on both relationship tables.
This is what I have, but it is massively slow on lots of data:
select * from activity
left join peopleactivity on peopleactivity.activityid = activity.activityid
left join entityactivity on entityactivity.activityid = activity.activityid
left join people on (peopleactivity.peopleid = people.peopleid OR
entityactivity.entityid = people.peopleid)
Some more notes - I have also tried creating a view to combine the results of the two relationship tables and instead joining people and activity via this view. This also works, but is also still massively slow
Changing how the relationship/s work to consolodate to one table is a major headache
I have also tried a union -like this -
select * from activity
left join peopleactivity on peopleactivity.activityid = activity.activityid
left join people on (peopleactivity.peopleid = people.peopleid)
union
select * from activity
left join peopleactivity on peopleactivity.activityid = activity.activityid
left join people on (entityactivity.entityid= people.peopleid)
which also works, but for other reasons causes me problems. I really need to do this in one query without changing too much underlying.
Has anyone got any super amazing ideas that I have missed??!
You may try to replace OR with IN
left join people on people.peopleid IN (peopleactivity.peopleid, entityactivity.entityid)
1.) Try setting the id of the tables as the primary key on each table
2.) Use inner joins instead of left joins. Not sure why you are using left joins here as you will get all the results of the other tables left joined on the activity table and get basically all records whether or not they have a join value in another table. I think this might also help you. Can you post a describe of your tables.
I think you should keep the UNION query but making those INNER joins. Do you really need LEFT joins?
You could also change it into UNION ALL, which will have some performance gain:
SELECT activity.*, people.*, 'PA' AS joining_table
FROM activity
JOIN peopleactivity ON peopleactivity.activityid = activity.activityid
JOIN people ON peopleactivity.peopleid = people.peopleid
UNION ALL
SELECT activity.*, people.*, 'EA'
FROM activity
JOIN entityactivity ON entityactivity.activityid = activity.activityid
JOIN people ON entityactivity.entityid = people.peopleid
Thanks for the comments. I had tried various incarnations of the above. My answer was to set up a new table, copy all the existing links into that table, and then use triggers to add/remove links to that table whenever the links were added removed in the two separate link tables. This works well and also allows me to use indexes on this new table to keep things nice and snappy. Many thanks for those that took the time to post the ideas though!
I have a table structure like the following:
user
id
name
profile_stat
id
name
profile_stat_value
id
name
user_profile
user_id
profile_stat_id
profile_stat_value_id
My question is:
How do I evaluate a query where I want to find all users with profile_stat_id and profile_stat_value_id for many stats?
I've tried doing an inner self join, but that quickly gets crazy when searching for many stats. I've also tried doing a count on the actual user_profile table, and that's much better, but still slow.
Is there some magic I'm missing? I have about 10 million rows in the user_profile table and want the query to take no longer than a few seconds. Is that possible?
Typically databases are able to handle 10 million records in a decent manner. I have mostly used oracle in our professional environment with large amounts of data (about 30-40 million rows also) and even doing join queries on the tables has never taken more than a second or two to run.
On IMPORTANT lessson I realized whenever query performance was bad was to see if the indexes are defined properly on the join fields. E.g. Here having index on profile_stat_id and profile_stat_value_id (user_id I am assuming is the primary key) should have indexes defined. This will definitely give you a good performance increaser if you have not done that.
After defining the indexes do run the query once or twice to give DB a chance to calculate the index tree and query plan before verifying the gain
Superficially, you seem to be asking for this, which includes no self-joins:
SELECT u.name, u.id, s.name, s.id, v.name, v.id
FROM User_Profile AS p
JOIN User AS u ON u.id = p.user_id
JOIN Profile_Stat AS s ON s.id = p.profile_stat_id
JOIN Profile_Stat_Value AS v ON v.id = p.profile_stat_value_id
Any of the joins listed can be changed to a LEFT OUTER JOIN if the corresponding table need not have a matching entry. All this does is join the central User_Profile table with each of the other three tables on the appropriate joining column.
Where do you think you need a self-join?
[I have not included anything to filter on 'the many stats'; it is not at all clear to me what that part of the question means.]