MySQL: Perform join on all rows of a table

MySQL: Perform join on all rows of a table - mysql

I have a view where I combine some normalized tables. Based on a "master" table, I join connected tables (e.g. JOIN child ON master.child_fk = child.pk). This is pretty straight forward. Now, I'd like to extend this query to perform a join on ALL child rows in some special cases, for example if the master.child_fk equals to -1.
I managed to get a working query by creating a view where I duplicate all rows and set the pk to -1 in the duplicates, but this is incredibly slow (I have quite a lot of data). The same result could be produced by iterating over all the child.pks and performing a separate join for each, but I can't imagine that being faster.
What would be the best way to go about this using MySQL? Please ask questions if something is not clear.
edit: I can add that it seems the reason why my attempt was slow was because of poor index utliziation. See attached EXPLAIN output here https://i.imgur.com/8zfT0HM.png

Replace your join condition as JOIN child ON CASE WHEN master.child_fk != -1 THEN master.child_fk = child.pk ELSE 1 END)

Related

Optimizing INNER JOIN across multiple tables

I have trawled many of the similar responses on this site and have improved my code at several stages along the way. Unfortunately, this 3-row query still won't run.
I have one table with 100k+ rows and about 30 columns of which I can filter down to 3-rows (in this example) and then perform INNER JOINs across 21 small lookup tables.
In my first attempt, I was lazy and used implicit joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `lookup_table` x 21
WHERE `master_table`.`indexed_col` = "value"
AND `lookup_table`.`id` = `lookup_col` x 21
The query looked to be timing out:
#2013 - Lost connection to MySQL server during query
Following this, I tried being explicit about the joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `master_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `master_table`.`lookup_col` x 21
WHERE `master_table`.`indexed_col` = "value"
Still got the same result. I then realised that the query was probably trying to perform the joins first, then filter down via the WHERE clause. So after a bit more research, I learned how I could apply a subquery to perform the filter first and then perform the joins on the newly created table. This is where I got to, and it still returns the same error. Is there any way I can improve this query further?
SELECT `temp_table`.*, `lookup_table`.`data_point` x 21
FROM (SELECT * FROM `master_table` WHERE `indexed_col` = "value") as `temp_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `temp_table`.`lookup_col` x 21
Is this the best way to write up this kind of query? I tested the subquery to ensure it only returns a small table and can confirm that it returns only three rows.

First, at its most simple aspect you are looking for
select
mt.*
from
Master_Table mt
where
mt.indexed_col = 'value'
That is probably instantaneous provided you have an index on your master table on the given indexed_col in the first position (in case you had a compound index of many fields)…
Now, if I am understanding you correctly on your different lookup columns (21 in total), you have just simplified them for redundancy in this post, but actually doing something in the effect of
select
mt.*,
lt1.lookupDescription1,
lt2.lookupDescription2,
...
lt21.lookupDescription21
from
Master_Table mt
JOIN Lookup_Table1 lt1
on mt.lookup_col1 = lt1.pk_col1
JOIN Lookup_Table2 lt2
on mt.lookup_col2 = lt2.pk_col2
...
JOIN Lookup_Table21 lt21
on mt.lookup_col21 = lt21.pk_col21
where
mt.indexed_col = 'value'
I had a project well over a decade ago dealing with a similar situation... the Master table had about 21+ million records and had to join to about 30+ lookup tables. The system crawled and queried died after running a query after more than 24 hrs.
This too was on a MySQL server and the fix was a single MySQL keyword...
Select STRAIGHT_JOIN mt.*, ...
By having your master table in the primary position, where clause and its criteria directly on the master table, you are good. You know the relationships of the tables. Do the query in the exact order I presented it to you. Don't try to think for me on this and try to optimize based on a subsidiary table that may have smaller record count and somehow think that will help the query faster... it won't.
Try the STRAIGHT_JOIN keyword. It took the query I was working on and finished it in about 1.5 hrs... it was returning all 21 million rows with all corresponding lookup key descriptions for final output, hence still needed a longer duration than just 3 records.

First, don't use a subquery. Write the query as:
SELECT mt.*, lt.`data_point`
FROM `master_table` mt INNER JOIN
`lookup_table` l
ON l.`id` = mt.`lookup_col`
WHERE mt.`indexed_col` = value;
The indexes that you want are master_table(value, lookup_col) and lookup_table(id, data_point).
If you are still having performance problems, then there are multiple possibilities. High among them is that the result set is simply too big to return in a reasonable amount of time. To see if that is the case, you can use select count(*) to count the number of returned rows.

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.

Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

optimizing particular query mysql

So I've been searching for a solution and reading books, and havent been able to figure it out, the question is rather simple, I have 2 tables. On one table I have 2 fields:
table_1:"chromosome" and "position" both of the being integers.
table_2:"chromosome" "start" and "end", all being integers as well.
I want a query that gives me back all rows from table_1 that are between the start and end of table_2. The query looks like this:
SELECT
table_1 . *
FROM
table_1,
table_2
WHERE
table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end;
So this query works fine, but my tables are many millions of rows (7092713) and (215909) respectvely. I indexed chromosome, pos and chromosome, start, end. The weird part is that if I do the query one by one (perl DBI, do one statement for every row of table_2), this runs a lot faster. Not sure where am I screwing up.
Any help would be appreciated.
Jorge Kageyama

For the sake of clarity, let's start by recasting your query using the standard JOIN syntax. The query is equivalent but easier to read.
SELECT table_1 . *
FROM table_1
JOIN table_2 ON ( table_1.chromosome = table_2.chromosome
AND table_1.position > table_2.start
AND table_1.position < table_1.end)
Second, it's smart when searching large tables (or any tables for that matter) to avoid * in your SELECT clauses. Using * denies useful data to the optimizer about what you do, or don't, need in your result set. So let us say
SELECT table_1.chromosome, table_1.position
for SELECT.
So, it becomes clear that your result set, and your join, need chromosome and position, and nothing else, from your larger table. Try creating a compound BTREE index on that table, as follows.
CREATE INDEX ON table_1(chromosome,position) USING BTREE
Similarly, try creating an index on table_2 as follows.
CREATE INDEX ON table_2(chromosome,start, end) USING BTREE
These are called covering indexes. They contain enough columns that the query can be satisfied from the index without having to bounce back to the original table.
BTREE indexes (the default by the way) are inherently ordered. Appropriate records in table_1 can be found by range scans on the index starting with (chromosome,start) and ending with (chromosome,end).
Third, it's possible you're getting a massive combinatorial explosion of rows from table_1 in your result set. You'll get a row for every combination of rows in the two tables that matches your ON() clause. It's hard to know whether that's the case without knowing a lot about your data.
You could try to reduce that combinatorial explosion using
SELECT DISTINCT table_1.chromosome, table_1.position
Give this a try. If you're still not getting anywhere, maybe another question with complete table definitions and the results of EXPLAIN will be helpful.

Interesting question. Without knowing more about the quantities contained in "position," I would still approach it generally in this way:
Select for position generally from table_1 (with 7.0mm entities) so that the resulting table is a bin of a smaller amount of data. Let's say, for instance, that the "position" quantity is a set of discrete integers from 2-9. Select from table_1 where position is equal to 2, then select from table_2 where "start" is less than 2 and "end" is greater than 2. Iterate over this query selection 8 times updating a new table_3 with results.
I am assuming here that table_2 is unique on chromosome, and table_1 is not. Therefore, you end up with chromosomes that could have multiple positions within the same range (a chromosome has one range, but can appear anywhere within that range). You also, then, can't tell how large the resulting join table is going to be, but it could be quite large as each of the 7mm entities in table_1 could be within all ranges in table_2.
Iterating would allow you to "grow" your results while observing the quality at each point experimentally before committing to the entire loop.
Here is an idea of the query I have in mind (untested):
SELECT table_1.chromosome, table_1.position, table_2.start, table_2.end
FROM
(SELECT table_1.chromosome, table_1.position
from table_1 where table_1.position = 2)
JOIN
(SELECT table_2.chromosome, table_2.start, table_2.end
from table_2 where table_2.start < 2 AND table_2.end > 2)
ON
table_1.chromosome = table_2.chromosome
Good luck, and I hope you find your answer!

Determine if joined table has 1 or more than 1 matching rows. Is there a better way than GROUP BY and COUNT?

I join table A to table B and need to know if table B has 1 matching row or more than one.
Of course, I can do it with GROUP BY and COUNT, but it's an overkill, because it has to count all the matches and I don't need this info.
Is there a simple way to get the info I need (only one matching row or more) which short circuits the evaluation and stops when it knows the answer without scanning and counting all the remaining matches?
Or should I not care about this, becasue it's not a big performance hit and I should simply go with COUNT?

It really depends on the size of the DB, and your exact requirements. Generally a count()/Group By/Having combination is a pretty efficient query, with the right indexes. You could do it in a more complicated way, for example, having a trigger on after update that keeps a count table updated.
Are you seeing the count(*)/group/having combination giving you performance issues?

If you just need to know if there is one or more than one row for a certain join sql, meaning a matching row:
-- Without any sample SQL code, here's a return sample
SELECT B.SOMEJOINAPPLICABLECOLUMN
FROM A
LEFT OUTER JOIN B
ON A.SOMEJOINAPPLICABLECOLUMN = B.SOMEJOINAPPLICABLECOLUMN
WHERE
B.SOMEJOINAPPLICABLECOLUMN IS NOT NULL
LIMIT 2;
Naturally:
2 returned rows = more than one match
1 returned row = one match
0 returned rows = no matches

Putting together a SQL Stored Proc

So I have a couple SQL commands that I basically want to make a proc, but while doing this, I'd like to optimize them a little bit more.
The first part of it is this:
select tr_reference_nbr
from cfo_daily_trans_hist
inner join cfo_fas157_valuation on fv_dh_daily_trans_hist_id = dh_daily_trans_hist_id
inner join cfo_tran_quote on tq_tran_quote_id = dh_tq_tran_quote_id
inner join cfo_transaction on tq_tr_transaction_id = tr_transaction_id
inner join cfo_fas157_project_valuation ON fpv_fas157_project_valuation_id = fv_fpv_fas157_project_valuation_id AND fpv_status_bit = 1
group by tr_reference_nbr, fv_dh_daily_trans_hist_id
having count(*)>1
This query returns to me which tr_reference_nbr's exist that have duplicate data in our system, which needs to be removed. After this is run, I run this other query, copying and pasting in the tr_reference_nbr one at a time that the above query gave me:
select
tr_reference_nbr , dh_daily_trans_hist_id ,cfo_fas157_project_valuation.*,
cfo_daily_trans_hist.* ,
cfo_fas157_valuation.*
from cfo_daily_trans_hist
inner join cfo_fas157_valuation on fv_dh_daily_trans_hist_id = dh_daily_trans_hist_id
inner join cfo_tran_quote on tq_tran_quote_id = dh_tq_tran_quote_id
inner join cfo_transaction on tq_tr_transaction_id = tr_transaction_id
iNNER JOIN cfo_fas157_project_valuation ON fpv_fas157_project_valuation_id = fv_fpv_fas157_project_valuation_id
where
tr_reference_nbr in
(
[PASTEDREFERENCENUMBER]
)
and fpv_status_bit = 1
order by dh_val_time_stamp desc
Now this query gives me a bunch of records for that specific tr_reference_nbr. I then have to look through this data and find the rows that have a matching (duplicate) dh_daily_trans_hist_id. Once this is found, I look and make sure that the following columns also match for that row so I know they are true duplicates: fpv_unadjusted_sponsor_charge, fpv_adjusted_sponsor_charge, fpv_unadjusted_counterparty_charge, and fpv_adjusted_counterparty_charge.
If THOSE all match, I then look to yet another column, fv_create_dt, and make sure that there is less then a minute difference between the two timestamps there. If there is, I run yet another query on the row that was stored EARLIER, which looks like this:
begin tran
update cfo_fas157_valuation set fpv_status_bit = 0 where fpv_fas157_project_valuation_id = [IDRECIEVEDFROMTHEOTHERTABLE]
commit
As you can see, this is still a very manual process even though we do have a few queries written, but I'm trying to find a solution to where we can just run one query, and it would basically do EVERYTHING except for the final query. So basically something that would provide to us a few fpv_fas157_project_valuation_id's that need to be updated.
From looking at these queries, do any of you guys see an easy way to combine all this? I've been working on it all day and can't seem to get something to run. I feel like I keep screwing up the joins and stuff.
Thanks!

You can combine these queries in multiple ways:
use temporary tables to store results of queries - suitable for stored procedure
use table variables to store results of queries - suitable for stored procedure
use Common Table Expressions (CTEs) to store results of queries - suitable for single query
Once You have them in separate tables/variables/CTEs You can easily join them.
Then You have to do one more thing, and that is to find difference in datetime in two consecutive rows. There is a trick to do this:
use ROW_NUMBER() to add a column with number of row partitioned by grouping fields (tr_reference_nbr, ... ) ordered by fv_create_dt
do a self join on A.ROW_NUMBER = B.ROW_NUMBER + 1
check the difference between A.fv_create_dt and B.fv_create_dt to filter the rows with difference less than a minute
Just do a good test of your self-join to make sure You filter only rows You need to filter.
If You still have problems with this, don't hesitate to leave a comment.
Interesting note: SQL Server Denali has T-SQL enhancements LEAD and LAG to access subsequent and previous row without self-joins.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008