Query's result set is too big - mysql

I have a query that can be fast or slow depending on how many records I'm fetching. Here's a table showing the number in my LIMIT clause and the corresponding time it takes to execute the query and fetch the results:
LIMIT | Seconds (Duration/Fetch)
------+-------------------------
10 | 0.030/ 0.0
100 | 0.062/ 0.0
1000 | 1.700/ 0.8
10000 | 25.000/100.0
As you can see, it's fine up to at least 1,000 but 10,000 is really slow, mostly due to a high fetch time. I don't understand why the growth of the fetch time isn't linear but I am grabbing over 200 columns from over 70 tables, so the fact that the result set takes a long time to fetch is not a surprise.
What I'm fetching, by the way, is data on all the accounts at a certain bank. The bank I'm dealing with has about 160,000 accounts so I ultimately need to fetch 160,000 rows from the database.
It's obviously not going to be feasible to try to fetch 160,000 rows at once (at least not unless I can somehow dramatically optimize my query). It seems to me that the biggest chunk I can reasonably grab is 1,000 rows, so I wrote a script that would run the query over and over with a SELECT INTO OUTFILE, limit and offset. Then, at the end, I take all the CSV files I dumped and cat them together. It works but it's slow. It takes hours. I have the script running right now and it's only dumped 43,000 rows in about an hour.
Should I attack this problem at the query optimization level or does the long fetch time suggest I should focus elsewhere? What would you recommend I do?
If you want to see the query you can see it here.

The answer is going to greatly depend on what you're doing with the data. Querying 215 columns through 29 joins will never be quick for non-trivial record sizes.
If you're trying to display 160,000 records to the user, you should page the results and only fetch one page at a time. This will keep the result set small enough that even a relatively inefficient query will return quickly. In this case, you will also want to examine just how much data the user needs in order to select or manipulate the data. Chances are good that you can pare it down to a handful of fields and some aggregates (count, sum, etc) that will let the user make an informed decision about which records they want to work with. Use LIMIT with an offset to pull single pages of arbitrary size.
If you need to export the data for reporting purposes, ensure that you are only pulling the exact data that the report needs. Eliminate joins where possible and use subqueries where you need an aggregate of child data. You'll want to tune/add indexes for the frequently used joins and criteria. In the case of your provided query, ib.id and the myriad of foreign keys you're joining through. You can leave off boolean columns because there are not enough distinct values to form a meaningful index.
Regardless of what you're trying to accomplish, removing some of the joins and columns will inherently speed up your processing. The amount of heavy lifting that MySQL needs to do to fill that query is your main stumbling block.

I've restructured your query to hopefully offer significant performance improvement time. By using the STRAIGHT_JOIN tells MySQL to do in the order you've stated (or I've adjusted here). The inner-most, first query "PreQuery" alias STARTS at your criteria of the import bundle and generic import, to the account import to the account... By pre-applying the WHERE clause there (and as you would test, add your LIMIT CLAUSE HERE) you are pre-joining these tables and getting them right out of the way before wasting any time trying to get the customers, address, etc other information going. In the query, I've adjusted the join/left joins to better show the relationship of the underlying linked tables (primarily for anyone else reading in).
As another person noted, what I've done in the PREQUERY could be a basis of "Account.ID" records in a master pre-query list used to go through and page-available. I would be curious to the performance of this to your existing especially at the 10,000 limit range.
The PREQUERY gets unique elements (including the Account ID used downstream, bank, month, year and category), so those tables don't have to be rejoined in the rest of the joining process.
SELECT STRAIGHT_JOIN
PreQuery.*,
customer.customer_number,
customer.name,
customer.has_bad_address,
address.line1,
address.line2,
address.city,
state.name,
address.zip,
po_box.line1,
po_box.line2,
po_box.city,
po_state.name,
po_box.zip,
customer.date_of_birth,
northway_account.cffna,
northway_account.cfinsc,
customer.deceased,
customer.social_security_number,
customer.has_internet_banking,
customer.safe_deposit_box,
account.has_bill_pay,
account.has_e_statement,
branch.number,
northway_product.code,
macatawa_product.code,
account.account_number,
account.available_line,
view_macatawa_atm_card.number,
view_macatawa_debit_card.number,
uc.code use_class,
account.open_date,
account.balance,
account.affinion,
northway_account.ytdsc,
northway_account.ytdodf,
northway_account.ytdnsf,
northway_account.rtckcy,
northway_account.rtckwy,
northway_account.odwvey,
northway_account.ytdscw,
northway_account.feeytd,
customer.do_not_mail,
northway_account.aledq1,
northway_account.aledq2,
northway_account.aledq3,
northway_account.aledq4,
northway_account.acolq1,
northway_account.acolq2,
northway_account.acolq3,
northway_account.acolq4,
o.officer_number,
northway_account.avg_bal_1,
northway_account.avg_bal_2,
northway_account.avg_bal_3,
account.maturity_date,
account.interest_rate,
northway_account.asslc,
northway_account.paidlc,
northway_account.lnuchg,
northway_account.ytdlc,
northway_account.extfee,
northway_account.penamt,
northway_account.cdytdwaive,
northway_account.cdterm,
northway_account.cdtcod,
account.date_of_last_statement,
northway_account.statement_cycle,
northway_account.cfna1,
northway_account.cfna2,
northway_account.cfna3,
northway_account.cfna4,
northway_account.cfcity,
northway_account.cfstate,
northway_account.cfzip,
northway_account.actype,
northway_account.sccode,
macatawa_account.account_type_code,
macatawa_account.account_type_code_description,
macatawa_account.advance_code,
macatawa_account.amount_last_advance,
macatawa_account.amount_last_payment,
macatawa_account.available_credit,
macatawa_account.balance_last_statement,
macatawa_account.billing_day,
macatawa_account.birthday_3,
macatawa_account.birthday_name_2,
macatawa_account.ceiling_rate,
macatawa_account.class_code,
macatawa_account.classified_doubtful,
macatawa_account.classified_loss,
macatawa_account.classified_special,
macatawa_account.classified_substandard,
macatawa_account.closed_account_flag,
macatawa_account.closing_balance,
macatawa_account.compounding_code,
macatawa_account.cost_center_full,
macatawa_account.cytd_aggregate_balance,
macatawa_account.cytd_amount_of_advances,
macatawa_account.cytd_amount_of_payments,
macatawa_account.cytd_average_balance,
macatawa_account.cytd_average_principal_balance,
macatawa_account.cytd_interest_paid,
macatawa_account.cytd_number_items_nsf,
macatawa_account.cytd_number_of_advanes,
macatawa_account.cytd_number_of_payments,
macatawa_account.cytd_number_times_od,
macatawa_account.cytd_other_charges,
macatawa_account.cytd_other_charges_waived,
macatawa_account.cytd_reporting_points,
macatawa_account.cytd_service_charge,
macatawa_account.cytd_service_charge_waived,
macatawa_account.date_closed,
macatawa_account.date_last_activity,
macatawa_account.date_last_advance,
macatawa_account.date_last_payment,
macatawa_account.date_paid_off,
macatawa_account.ddl_code,
macatawa_account.deposit_rate_index,
macatawa_account.employee_officer_director_full_desc,
macatawa_account.floor_rate,
macatawa_account.handling_code,
macatawa_account.how_paid_code,
macatawa_account.interest_frequency,
macatawa_account.ira_plan,
macatawa_account.load_rate_code,
macatawa_account.loan_rate_code,
macatawa_account.loan_rating_code,
macatawa_account.loan_rating_code_1_full_desc,
macatawa_account.loan_rating_code_2_full_desc,
macatawa_account.loan_rating_code_3_full_desc,
macatawa_account.loan_to_value_ratio,
macatawa_account.maximum_credit,
macatawa_account.miscellaneous_code_full_desc,
macatawa_account.months_to_maturity,
macatawa_account.msa_code,
macatawa_account.mtd_agg_available_balance,
macatawa_account.naics_code,
macatawa_account.name_2,
macatawa_account.name_3,
macatawa_account.name_line,
macatawa_account.name_line_2,
macatawa_account.name_line_3,
macatawa_account.name_line_1,
macatawa_account.net_payoff,
macatawa_account.opened_by_responsibility_code_full,
macatawa_account.original_issue_date,
macatawa_account.original_maturity_date,
macatawa_account.original_note_amount,
macatawa_account.original_note_date,
macatawa_account.original_prepaid_fees,
macatawa_account.participation_placed_code,
macatawa_account.participation_priority_code,
macatawa_account.pay_to_account,
macatawa_account.payment_code,
macatawa_account.payoff_principal_balance,
macatawa_account.percent_participated_code,
macatawa_account.pmtd_number_deposit_type_1,
macatawa_account.pmtd_number_deposit_type_2,
macatawa_account.pmtd_number_deposit_type_3,
macatawa_account.pmtd_number_type_1,
macatawa_account.pmtd_number_type_2,
macatawa_account.pmtd_number_type_6,
macatawa_account.pmtd_number_type_8,
macatawa_account.pmtd_number_type_9,
macatawa_account.principal,
macatawa_account.purpose_code,
macatawa_account.purpose_code_full_desc,
macatawa_account.pytd_number_of_items_nsf,
macatawa_account.pytd_number_of_times_od,
macatawa_account.rate_adjuster,
macatawa_account.rate_over_split,
macatawa_account.rate_under_split,
macatawa_account.renewal_code,
macatawa_account.renewal_date,
macatawa_account.responsibility_code_full,
macatawa_account.secured_unsecured_code,
macatawa_account.short_first_name_1,
macatawa_account.short_first_name_2,
macatawa_account.short_first_name_3,
macatawa_account.short_last_name_1,
macatawa_account.short_last_name_2,
macatawa_account.short_last_name_3,
macatawa_account.statement_cycle,
macatawa_account.statement_rate,
macatawa_account.status_code,
macatawa_account.tax_id_number_name_2,
macatawa_account.tax_id_number_name_3,
macatawa_account.teller_alert_1,
macatawa_account.teller_alert_2,
macatawa_account.teller_alert_3,
macatawa_account.term,
macatawa_account.term_code,
macatawa_account.times_past_due_01_29,
macatawa_account.times_past_due_01_to_29_days,
macatawa_account.times_past_due_30_59,
macatawa_account.times_past_due_30_to_59_days,
macatawa_account.times_past_due_60_89,
macatawa_account.times_past_due_60_to_89_days,
macatawa_account.times_past_due_over_90,
macatawa_account.times_past_due_over_90_days,
macatawa_account.tin_code_name_1,
macatawa_account.tin_code_name,
macatawa_account.tin_code_name_2,
macatawa_account.tin_code_name_3,
macatawa_account.total_amount_past_due,
macatawa_account.waiver_od_charge,
macatawa_account.waiver_od_charge_description,
macatawa_account.waiver_service_charge_code,
macatawa_account.waiver_transfer_advance_fee,
macatawa_account.short_first_name,
macatawa_account.short_last_name
FROM
( SELECT STRAIGHT_JOIN DISTINCT
b.name bank,
ib.YEAR,
ib.MONTH,
ip.category,
Account.ID
FROM import_bundle ib
JOIN generic_import gi ON ib.id = gi.import_bundle_id
JOIN account_import AI ON gi.id = ai.generic_import_id
JOIN Account ON AI.ID = account.account_import_id
JOIN import_profile ip ON gi.import_profile_id = ip.id
JOIN bank b ib.Bank_ID = b.id
WHERE
IB.ID = 95
AND IB.Active = 1
AND GI.Active = 1
LIMIT 1000 ) PreQuery
JOIN Account on PreQuery.ID = Account.ID
JOIN Customer on Account.Customer_ID = Customer.ID
JOIN Officer on Account.Officer_ID = Officer.ID
LEFT JOIN branch ON Account.branch_id = branch.id
LEFT JOIN cd_type ON account.cd_type_id = cd_type.id
LEFT JOIN use_class uc ON account.use_class_id = uc.id
LEFT JOIN account_type at ON account.account_type_id = at.id
LEFT JOIN northway_account ON account.id = northway_account.account_id
LEFT JOIN macatawa_account ON account.id = macatawa_account.account_id
LEFT JOIN view_macatawa_debit_card ON account.id = view_macatawa_debit_card.account_id
LEFT JOIN view_macatawa_atm_card ON account.id = view_macatawa_atm_card.account_id
LEFT JOIN original_address OA ON Account.ID = OA.account_id
JOIN Account_Address AA ON Account.ID = AA.account_id
JOIN address ON AA.address_id = address.id
JOIN state ON address.state_id = state.id
LEFT JOIN Account_po_box APB ON Account.ID = APB.account_id
LEFT JOIN address po_box ON APB.address_id = po_box.id
LEFT JOIN state po_state ON po_box.state_id = po_state.id
LEFT JOIN Account_macatawa_product amp ON account.id = amp.account_id
LEFT JOIN macatawa_product ON amp.macatawa_product_id = macatawa_product.id
LEFT JOIN product_type pt ON macatawa_product.product_type_id = pt.id
LEFT JOIN harte_hanks_service_category hhsc ON macatawa_product.harte_hanks_service_category_id = hhsc.id
LEFT JOIN core_file_type cft ON macatawa_product.core_file_type_id = cft.id
LEFT JOIN Account_northway_product anp ON account.id = anp.account_id
LEFT JOIN northway_product ON anp.northway_product_id = northway_product.id

The non-linear increase in fetch time is likely the result of key buffers filling up, and probably other memory related issues as well. You should both optimize the query using EXPLAIN to maximize use of indexes, and tune your MySQL server settings.

Related

Mysql Join Query taking a long time to execute

I have a query which is taking a long time to execute.
Table descriptions. These tables are very large so will give relevant columns in description. All Columns are varchar.
Table 1 - General
PK - CLAIM_ID
No of Records - 2.63 Mill,
Table 2 - Enrol
No of Records - 2.5 Million
Cols - CLAIM_ID(PK),POLICY_ID,MEMBER_ID
Table 3 - Member
No fo Records - 28 million
Cols - MEMBER_ID(PK),POLICY_GROUP_ID
Table 4 - Policy
No fo Records - 2 Million
Cols- POLICY_ID,policy_sub_general_type_id
table 5 - Balance
No of Records - 12 Million.
Columns
Query is
SELECT cg.CLAIM_ID,mem.Policy_group_ID ,
CAST(CASE when pol.policy_sub_general_type_id = 'PFL'
then (bal2.sum_insured - bal2.utilised_sum_insured)
when pol.policy_sub_general_type_id = 'PNF'
then (bal1.sum_insured - bal1.utilised_sum_insured)
end AS DECIMAL(10, 2) ) Balance_SI
FROM General cg
LEFT JOIN Enrol ce ON cg.CLAIM_ID = ce.CLAIM_ID
LEFT JOIN Member mem ON ce.MEMBER_ID = mem.MEMBER_ID
LEFT JOIN Policy pol ON pol.POLICY_ID = ce.POLICY_ID
LEFT join Balance bal1 ON bal1.MEMBER_ID = ce.MEMBER_ID
and bal1.MEMBER_ID is not null
LEFT join Balance bal2 ON bal2.Policy_group_ID = mem.Policy_group_ID
and bal2.Policy_group_ID is not null
GROUP BY cg.CLAIM_ID
Explain Statement shows
Select Type|table|Type|key|rows|Extra
_____________________________________
SIMPLE|cg |index|PRIMARY|2662233|Using Index
SIMPLE|ce |ref|index1|1|NULL
SIMPLE|mem|eq_ref|PRIMARY|1|using where
SIMPLE|pol|eq_ref|PRIMARY|1| Using Where
SIMPLE|bal1|ref|index2|3|Using Where
SIMPLE|bal2|ref|index1|1|using where
Server params
InnoDB_Buffer_pool - 10GB
InnoDB_Log_File_Size - 3GB
4 Core processor
All tables and columns have same collation and character set, So this is not a collation issue. Also also join columns are varchar. Explain statement shows ( I assume) tables are indexed well.
Query takes around 15 minutes to return first 50000 rows which is unacceptable at this point of time.For entire table it is still running for last 3 hours without any result.
No Idea why this is happening. Please help.
For starters, you can completely remove your "cg" alias General table unless you are using for some other columns you are not showing here. Reason, you have the claim ID directly from your enrollment table. Just removes extra level.
Next, your Group by is only on the claim, but the policy group ID is part of your select. Did you intend to have it aggregated per policy too? Can one claim be covered by multiple policy groups? If not and you are just trying to carry that forward, you can keep it via
MAX( mem.Policy_Group_ID ) as Policy_Group_ID
As noted by Strawberry, doing aggregates / group by where you MIGHT have Cartesian results could give you false answers.
I would also suggest editing your post and confirming some additional details such as the Balance Table. You have one total based on "PFL" for "PNF" we know there specific meaning behind them, but mean nothing to us. Your case/when is pulling the value from the "Bal1" vs "Bal2" alias why. Is this a condition where a specific policy group is NOT entered into the balance table and it falls into either some "generic bucket", or a bucket specific to a single policy? Such as regular coverage of "X", but you have a limit on category "Y"?
Below is cleaner SQL readability with removal of the general table.
SELECT
ce.CLAIM_ID,
mem.Policy_group_ID,
CAST(CASE when pol.policy_sub_general_type_id = 'PFL'
then (bal2.sum_insured - bal2.utilised_sum_insured)
when pol.policy_sub_general_type_id = 'PNF'
then (bal1.sum_insured - bal1.utilised_sum_insured) end AS DECIMAL(10,2)) Balance_SI
FROM
Enrol ce
LEFT JOIN Member mem
on ce.MEMBER_ID = mem.MEMBER_ID
LEFT join Balance bal2
on mem.Policy_group_ID = bal2.Policy_group_ID
and bal2.Policy_group_ID <> ''
LEFT JOIN Policy pol
on ce.POLICY_ID = pol.POLICY_ID
LEFT join Balance bal1
on ce.MEMBER_ID = bal1.MEMBER_ID
and bal1.MEMBER_ID <> ''
GROUP BY
ce.CLAIM_ID
Finally, looking at your case/when and join on Bal2 alias, you have no reference to a member ID, so lets show you the Cartesian Killer you are probably encountering. Example, Federal employees fall into a policy group and have 20k employees. Now you have one Enrollment record left-joining to the balance table? is it one record per policy group or one per member / policy group. If the per member / policy, you are plowing through 20k balance records every time trying to get value from Bal2. While the Balance table "Bal1" alias is explicit per member ID. So I know both fields are in the table and that might be killing you.
Again, please edit your existing post for clarification of details and relationships, especially 1:1 vs 1:n
This is not an answer yet
Your DB schema is not clear to me.
I have many questions and a lot of ideas how to speed this query up.
Lets take a look on your 1st part of query:
SELECT cg.CLAIM_ID,
mem.Policy_group_ID,
CAST(
CASE
when
pol.policy_sub_general_type_id = 'PFL' then
(bal2.sum_insured - bal2.utilised_sum_insured)
when pol.policy_sub_general_type_id = 'PNF' then
(bal1.sum_insured - bal1.utilised_sum_insured)
END
AS DECIMAL(10,2)
) Balance_SI
You have "inline" function calls, which hit the performance: CAST, CASE, bal1.sum_insured - bal1.utilised_sum_insured, bal2.sum_insured - bal2.utilised_sum_insured
If your app or whatever you do can accept not "formatted" result to be returned by query, I would suggest to remove CAST - it will speed up query a bit with no affect on real values returned. You can round those values later on application level.
Next is CASE, again if you have your app level (I hope) you can return raw data instead of transformed result. I mean you can return 3 columns: pol.policy_sub_general_type_id, bal1.sum_insured - bal1.utilised_sum_insured, bal2.sum_insured - bal2.utilised_sum_insured instead of CASE. But I suspect you don't need even this optimization. I'll show that later.
I have many questions about your JOINs as well. But since you did not reply on DRapp answer yet I will keep my questions for a while.
Lets go directly to the query I suspect will return almost the same data you need and discuss details later if you will have any particular questions.
SELECT
cg.CLAIM_ID,
mem.Policy_group_ID ,
SUM(bal.sum_insured - bal.utilised_sum_insured) Balance_SI
FROM `General` cg
LEFT JOIN Enrol ce
ON cg.CLAIM_ID = ce.CLAIM_ID
LEFT JOIN Member mem
ON ce.MEMBER_ID = mem.MEMBER_ID
LEFT JOIN Policy pol
ON pol.POLICY_ID = ce.POLICY_ID
AND (pol.policy_sub_general_type_id = 'PNF'
OR pol.policy_sub_general_type_id = 'PFL')
LEFT JOIN Balance bal
ON (bal.MEMBER_ID = ce.MEMBER_ID
AND bal.MEMBER_ID <> '')
OR (bal.Policy_group_ID = mem.Policy_group_ID
AND bal.Policy_group_ID <> '')
GROUP BY cg.CLAIM_ID, mem.Policy_group_ID

Sql Join taking a lot of time

I am tying to execute this query but it is taking more than 5 hours, but the data base size is just 20mb. this is my code. Here I am joining 11 tables with reg_id. I need all columns with distinct values. Please guide me how to rearrange the query.
SELECT *
FROM degree
JOIN diploma
ON degree.reg_id = diploma.reg_id
JOIN further_studies
ON diploma.reg_id = further_studies.reg_id
JOIN iti
ON further_studies.reg_id = iti.reg_id
JOIN personal_info
ON iti.reg_id = personal_info.reg_id
JOIN postgraduation
ON personal_info.reg_id = postgraduation.reg_id
JOIN puc
ON postgraduation.reg_id = puc.reg_id
JOIN skills
ON puc.reg_id = skills.reg_id
JOIN sslc
ON skills.reg_id = sslc.reg_id
JOIN license
ON sslc.reg_id = license.reg_id
JOIN passport
ON license.reg_id = passport.reg_id
GROUP BY fullname
Please help me if I did any mistake
This is a bit long for a comment.
The first problem with your query is that you are using select * with group by fullname. You have zillions of columns in the select that are not in the group by. Unless you really, really, really know what you are doing (which I doubt), this is the wrong way to write a query.
Your performance problem is undoubtedly due to cartesian products and lack of indexes. You are joining across different dimensions -- such as skills and degrees. The result is a product of all the possibilities. For some people, the data size can grow and grow and grow.
And then, the question is: do you have indexes on the keys used in the joins? For performance, you generally want such indexes.
I thought the problem is in the query.First make sure group by fullname and try to give some column names instead of *.

SELECT DISTINCT statement in MySQL is taking 10 minutes

I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);

Help me optimize this query

I have this query for an application that I am designing. There is a table of references, an authors table and a reference_authors table. There is a sub query to return all authors for a given reference which I then display formatted in php. The subquery and query run individually are both nice and speedy. However as soon as the subquery is put into the main query the whole thing takes over 120s to run. I would apprecaite some fresh eyes on this one.
Thanks.
SELECT
rf.reference_id,
rf.reference_type_id,
rf.article_title,
rf.publication,
rf.annotation,
rf.publication_year,
(SELECT GROUP_CONCAT(a.author_name)
FROM authors_final AS a
INNER JOIN reference_authors AS ra2 ON ra2.author_id = a.author_id
WHERE ra2.reference_id = rf.reference_id
GROUP BY ra2.reference_id) AS authors
FROM
references_final AS rf
INNER JOIN reference_authors AS ra ON rf.reference_id = ra.reference_id
LEFT JOIN reference_institutes AS ri ON rf.reference_id = ri.reference_id;
Here is the fixed query. Thanks guys for the recommendations.
SELECT
rf.reference_id,
rf.reference_type_id,
rf.article_title,
rf.publication,
rf.annotation,
rf.publication_year,
GROUP_CONCAT(a.author_name) AS authors
FROM
references_final as rf
INNER JOIN (reference_authors AS ra INNER JOIN authors_final AS a ON ra.author_id = a.author_id)
ON rf.reference_id = ra.reference_id
LEFT JOIN reference_institutes AS ri ON rf.reference_id = ri.reference_id
GROUP BY rf.reference_id
Although not every subquery can be rewritten as an inner join, I think yours can.
From 120 seconds to 78 milliseconds is not a bad improvement--about three orders of magnitude. Take the rest of the day off.
When you come back tomorrow, start looking for other subqueries in your source code.
You say the subquery is nice and speedy in isolation but its now obviously running for every single row - 100 rows = 100 sub queries.
Assuming you have indexes on all your foreign keys that's as good as it gets as a sub query.
One option is to left join authors and create a Cartesian product - you'll have a lot more rows returned and will need some code to get to the same end result but it will put less strain on the db and will run quicker.
If you've got paging on and say are returning 10 rows, issung 10 individual calls to get the authors in isolation would also be be pretty quick.

In what order are MySQL JOINs evaluated?

I have the following query:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123;
I have the following questions:
Is the USING syntax synonymous with ON syntax?
Are these joins evaluated left to right? In other words, does this query say: x = companies JOIN users; y = x JOIN jobs; z = y JOIN useraccounts;
If the answer to question 2 is yes, is it safe to assume that the companies table has companyid, userid and jobid columns?
I don't understand how the WHERE clause can be used to pick rows on the companies table when it is referring to the alias "j"
Any help would be appreciated!
USING (fieldname) is a shorthand way of saying ON table1.fieldname = table2.fieldname.
SQL doesn't define the 'order' in which JOINS are done because it is not the nature of the language. Obviously an order has to be specified in the statement, but an INNER JOIN can be considered commutative: you can list them in any order and you will get the same results.
That said, when constructing a SELECT ... JOIN, particularly one that includes LEFT JOINs, I've found it makes sense to regard the third JOIN as joining the new table to the results of the first JOIN, the fourth JOIN as joining the results of the second JOIN, and so on.
More rarely, the specified order can influence the behaviour of the query optimizer, due to the way it influences the heuristics.
No. The way the query is assembled, it requires that companies and users both have a companyid, jobs has a userid and a jobid and useraccounts has a userid. However, only one of companies or user needs a userid for the JOIN to work.
The WHERE clause is filtering the whole result -- i.e. all JOINed columns -- using a column provided by the jobs table.
I can't answer the bit about the USING syntax. That's weird. I've never seen it before, having always used an ON clause instead.
But what I can tell you is that the order of JOIN operations is determined dynamically by the query optimizer when it constructs its query plan, based on a system of optimization heuristics, some of which are:
Is the JOIN performed on a primary key field? If so, this gets high priority in the query plan.
Is the JOIN performed on a foreign key field? This also gets high priority.
Does an index exist on the joined field? If so, bump the priority.
Is a JOIN operation performed on a field in WHERE clause? Can the WHERE clause expression be evaluated by examining the index (rather than by performing a table scan)? This is a major optimization opportunity, so it gets a major priority bump.
What is the cardinality of the joined column? Columns with high cardinality give the optimizer more opportunities to discriminate against false matches (those that don't satisfy the WHERE clause or the ON clause), so high-cardinality joins are usually processed before low-cardinality joins.
How many actual rows are in the joined table? Joining against a table with only 100 values is going to create less of a data explosion than joining against a table with ten million rows.
Anyhow... the point is... there are a LOT of variables that go into the query execution plan. If you want to see how MySQL optimizes its queries, use the EXPLAIN syntax.
And here's a good article to read:
http://www.informit.com/articles/article.aspx?p=377652
ON EDIT:
To answer your 4th question: You aren't querying the "companies" table. You're querying the joined cross-product of ALL four tables in your FROM and USING clauses.
The "j.jobid" alias is just the fully-qualified name of one of the columns in that joined collection of tables.
In MySQL, it's often interesting to ask the query optimizer what it plans to do, with:
EXPLAIN SELECT [...]
See "7.2.1 Optimizing Queries with EXPLAIN"
Here is a more detailed answer on JOIN precedence. In your case, the JOINs are all commutative. Let's try one where they aren't.
Build schema:
CREATE TABLE users (
name text
);
CREATE TABLE orders (
order_id text,
user_name text
);
CREATE TABLE shipments (
order_id text,
fulfiller text
);
Add data:
INSERT INTO users VALUES ('Bob'), ('Mary');
INSERT INTO orders VALUES ('order1', 'Bob');
INSERT INTO shipments VALUES ('order1', 'Fulfilling Mary');
Run query:
SELECT *
FROM users
LEFT OUTER JOIN orders
ON orders.user_name = users.name
JOIN shipments
ON shipments.order_id = orders.order_id
Result:
Only the Bob row is returned
Analysis:
In this query the LEFT OUTER JOIN was evaluated first and the JOIN was evaluated on the composite result of the LEFT OUTER JOIN.
Second query:
SELECT *
FROM users
LEFT OUTER JOIN (
orders
JOIN shipments
ON shipments.order_id = orders.order_id)
ON orders.user_name = users.name
Result:
One row for Bob (with the fulfillment data) and one row for Mary with NULLs for fulfillment data.
Analysis:
The parenthesis changed the evaluation order.
Further MySQL documentation is at https://dev.mysql.com/doc/refman/5.5/en/nested-join-optimization.html
SEE http://dev.mysql.com/doc/refman/5.0/en/join.html
AND start reading here:
Join Processing Changes in MySQL 5.0.12
Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The goal was to align the syntax and semantics of MySQL with respect to NATURAL JOIN and JOIN ... USING according to SQL:2003. However, these changes in join processing can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions must be rewritten to comply with the standard.
These changes have five main aspects:
The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
Resolution of column names in NATURAL or USING joins.
Transformation of NATURAL or USING joins into JOIN ... ON.
Resolution of column names in the ON condition of a JOIN ... ON.
Im not sure about the ON vs USING part (though this website says they are the same)
As for the ordering question, its entirely implementation (and probably query) specific. MYSQL most likely picks an order when compiling the request. If you do want to enforce a particular order you would have to 'nest' your queries:
SELECT c.*
FROM companies AS c
JOIN (SELECT * FROM users AS u
JOIN (SELECT * FROM jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123)
)
as for part 4: the where clause limits what rows from the jobs table are eligible to be JOINed on. So if there are rows which would join due to the matching userids but don't have the correct jobid then they will be omitted.
1) Using is not exactly the same as on, but it is short hand where both tables have a column with the same name you are joining on... see: http://www.java2s.com/Tutorial/MySQL/0100__Table-Join/ThekeywordUSINGcanbeusedasareplacementfortheONkeywordduringthetableJoins.htm
It is more difficult to read in my opinion, so I'd go spelling out the joins.
3) It is not clear from this query, but I would guess it does not.
2) Assuming you are joining through the other tables (not all directly on companyies) the order in this query does matter... see comparisons below:
Origional:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123
What I think it is likely suggesting:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = u.userid
JOIN useraccounts AS us on us.userid = u.userid
WHERE j.jobid = 123
You could switch you lines joining jobs & usersaccounts here.
What it would look like if everything joined on company:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = c.userid
JOIN useraccounts AS us on us.userid = c.userid
WHERE j.jobid = 123
This doesn't really make logical sense... unless each user has their own company.
4.) The magic of sql is that you can only show certain columns but all of them are their for sorting and filtering...
if you returned
SELECT c.*, j.jobid....
you could clearly see what it was filtering on, but the database server doesn't care if you output a row or not for filtering.