Using a join in Mysql update statement instead of sub-query - mysql

I'm currently using the following to update a table of mine:
UPDATE Check_Dictionary
SET InDict = "No" WHERE (Leagues, Matches, Line) IN (SELECT * FROM (
SELECT Leagues, Matches, Line FROM Check_Dictionary
WHERE InDict = "No")as X)
However, when I have large data sets (40k+ rows) this seems to be fairly inefficient/slow. All of the searching I'm doing suggests that joins are far more efficient for this sort of thing than a sub-query. However, being a mysql newbie I'm not sure of the best way to do it.
My table may have multiple rows where the League/Matches/Line fields are the same. Generally the InDict field on these rows will be "Yes". However, if one of them is "No" I need to update all of the other rows with the same League/Matches/Line columns to "No" as well (so they all have a value of "No").
Would using a join in Mysql update statement instead of sub-query be more efficient?
How can I do it using a join?

I would think a join should be faster, but it depends on indexing and other things, you should try it for yourself to see which performs better (and maybe use explain to analyze the queries).
As for syntax, any of these should work:
UPDATE Check_Dictionary c1
JOIN (
SELECT Leagues, Matches, Line
FROM Check_Dictionary
WHERE InDict = "No"
) AS X USING (Leagues, Matches, Line)
SET InDict = "No"
UPDATE Check_Dictionary AS c1
JOIN Check_Dictionary AS c2 USING (Leagues, Matches, Line)
SET c1.InDict = "No"
WHERE c2.InDict = "No"

The update join query given by "jpw" was correct you can use it, I don't want to repeat. Having said, i just want to post join is faster than subquery obviously especially if you want to update 40K+ rows. Below is the data from MySQL documentation says about the same.
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone. Prior to SQL-92, outer joins did not exist, so subqueries were the only way to do certain things. Today, MySQL Server and many other modern database systems offer a wide range of outer join types.
MySQL Server supports multiple-table DELETE statements that can be used to efficiently delete rows based on information from one table or even from many tables at the same time. Multiple-table UPDATE statements are also supported. See Section 13.2.2, “DELETE Syntax”, and Section 13.2.10, “UPDATE Syntax”.
Source : http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html

Related

How to optimize a MySQL update which contains an "in" subquery?

How do I optimize the following update because the sub-query is being executed for each row in table a?
update
a
set
col = 1
where
col_foreign_id not in (select col_foreign_id in b)
You could potentially use an outer join where there are no matching records instead of your not in:
update table1 a
left join table2 b on a.col_foreign_id = b.col_foreign_id
set a.col = 1
where b.col_foreign_id is null
This should use a simple select type rather than a dependent subquery.
Your current query (or the one that actually works since the example in the OP doesn't look like it would) is potentially dangerous in that a NULL in b.col_foreign_id would cause nothing to match, and you'd update no rows.
not exists would also be something to look at if you want to replace not in.
I can't tell you that this will make your query any faster, but there is some good info here. You'll have to test in your environment.
Here's a SQL Fiddle illuminating the differences between in, exists, and outer join (check the rows returned, null handling, and execution plans).

MySQL Query Efficiency Suggestion

Although this question is specific to MySQL, I wouldn't mind knowing if this answer applies to SQL engines in general.
Also, since this isn't a syntax query, I'm using psuedo-SQL for brevity/clarity.
Let's say C[1]..C[M] are a set of criteria (separated by AND or OR) and Q[1]..Q[N] are another set (separated by OR). I want to use C[1]...C[M] to filter a table and from this filtered table, I want all the rows matching Q[1]...Q[N].
If I were to do:
SELECT ... FROM ... WHERE (C[1]...C[M]) AND (Q[1]...Q[N])
Would this be automatically optimized so that C[1]...C[M] is found only once and each Q[i] is run against this cache'ed result? If not, should I then split the query into two like so:
INSERT INTO TEMP ... SELECT ... FROM ... WHERE C[1]...C[N]
SELECT ... FROM TEMP WHERE Q[1]...Q[N]
This is the job of the internal query optimizer to calculate the best order for compiling the joins according to filters.
For instance in:
SELECT *
FROM
table1
INNER JOIN table2 ON table1.id = table2.id AND table2.column = Y
INNER JOIN table3 ON table3.id2 = table2.id2 AND table3.column = Z
WHERE
table1.column = X
Mysql (/oracle/sqlserver etc...) would try to compute beforehand each intermediary resultset to provide the best performances, and actually here the engine is doing a pretty good job.
However, everything relies on the statistics it actually has about the tables and the indexes you've setup in your architecture. These 2 points (apart from filling up the tables with datas...) are the only ones you can influence to help the optimizer to make good decisions by giving it the right and accurate information.
I hope it helped.
ps: have a look at this. This is about operators and order of precedence in queries compilation under oracle yet it is probably a good thing to know anyway:
http://ezinearticles.com/?Oracle-SQL---The-Importance-of-Order-of-Precedence&id=1597846

MySQLoptimization in combining 3 tables and search

Can anyone tell me whats wrong with this MySQL query ?
select distinct(a.productId)
from product a
left join product_keyword b
on b.productId = a.productId
left join keywords c
on c.keywordId = b.keywordId
where a.productName LIKE '%truck%' OR c.value LIKE '%truck%'
limit 100;
Actually I need to join 3 tables (product, product_keyword and keywords) and search based on user input. One product can be multiple keywords and I store it (keywordId from table keywords in product_keyword).
Can anyone help me please?
When you use the % wildcard in the prefix of a LIKE search, MySQL isn't able to utilize any indexes for the search. Instead, MySQL must scan all of the rows.
You should at least have indexes on the join columns (productID and keywordID) so that MySQL is able to more quickly perform the join operations. However, if the result set is too large, MySQL will perform a scan for the JOINs as well.
Most likely, MySQL is scanning each row in product, then performing the JOIN to product_keyword, then performing the join to keywords. Then, it checks to see if it can exclude the row based on the WHERE clause. Once it returns 100 rows, it stops.
If your tables are large, this will be a very expensive query. Using a leading wildcard on a LIKE query will usually be very slow. If you need that sort of search capability, it is probably better to do it externally in Lucene or something similar, rather than in the database.

Putting together a SQL Stored Proc

So I have a couple SQL commands that I basically want to make a proc, but while doing this, I'd like to optimize them a little bit more.
The first part of it is this:
select tr_reference_nbr
from cfo_daily_trans_hist
inner join cfo_fas157_valuation on fv_dh_daily_trans_hist_id = dh_daily_trans_hist_id
inner join cfo_tran_quote on tq_tran_quote_id = dh_tq_tran_quote_id
inner join cfo_transaction on tq_tr_transaction_id = tr_transaction_id
inner join cfo_fas157_project_valuation ON fpv_fas157_project_valuation_id = fv_fpv_fas157_project_valuation_id AND fpv_status_bit = 1
group by tr_reference_nbr, fv_dh_daily_trans_hist_id
having count(*)>1
This query returns to me which tr_reference_nbr's exist that have duplicate data in our system, which needs to be removed. After this is run, I run this other query, copying and pasting in the tr_reference_nbr one at a time that the above query gave me:
select
tr_reference_nbr , dh_daily_trans_hist_id ,cfo_fas157_project_valuation.*,
cfo_daily_trans_hist.* ,
cfo_fas157_valuation.*
from cfo_daily_trans_hist
inner join cfo_fas157_valuation on fv_dh_daily_trans_hist_id = dh_daily_trans_hist_id
inner join cfo_tran_quote on tq_tran_quote_id = dh_tq_tran_quote_id
inner join cfo_transaction on tq_tr_transaction_id = tr_transaction_id
iNNER JOIN cfo_fas157_project_valuation ON fpv_fas157_project_valuation_id = fv_fpv_fas157_project_valuation_id
where
tr_reference_nbr in
(
[PASTEDREFERENCENUMBER]
)
and fpv_status_bit = 1
order by dh_val_time_stamp desc
Now this query gives me a bunch of records for that specific tr_reference_nbr. I then have to look through this data and find the rows that have a matching (duplicate) dh_daily_trans_hist_id. Once this is found, I look and make sure that the following columns also match for that row so I know they are true duplicates: fpv_unadjusted_sponsor_charge, fpv_adjusted_sponsor_charge, fpv_unadjusted_counterparty_charge, and fpv_adjusted_counterparty_charge.
If THOSE all match, I then look to yet another column, fv_create_dt, and make sure that there is less then a minute difference between the two timestamps there. If there is, I run yet another query on the row that was stored EARLIER, which looks like this:
begin tran
update cfo_fas157_valuation set fpv_status_bit = 0 where fpv_fas157_project_valuation_id = [IDRECIEVEDFROMTHEOTHERTABLE]
commit
As you can see, this is still a very manual process even though we do have a few queries written, but I'm trying to find a solution to where we can just run one query, and it would basically do EVERYTHING except for the final query. So basically something that would provide to us a few fpv_fas157_project_valuation_id's that need to be updated.
From looking at these queries, do any of you guys see an easy way to combine all this? I've been working on it all day and can't seem to get something to run. I feel like I keep screwing up the joins and stuff.
Thanks!
You can combine these queries in multiple ways:
use temporary tables to store results of queries - suitable for stored procedure
use table variables to store results of queries - suitable for stored procedure
use Common Table Expressions (CTEs) to store results of queries - suitable for single query
Once You have them in separate tables/variables/CTEs You can easily join them.
Then You have to do one more thing, and that is to find difference in datetime in two consecutive rows. There is a trick to do this:
use ROW_NUMBER() to add a column with number of row partitioned by grouping fields (tr_reference_nbr, ... ) ordered by fv_create_dt
do a self join on A.ROW_NUMBER = B.ROW_NUMBER + 1
check the difference between A.fv_create_dt and B.fv_create_dt to filter the rows with difference less than a minute
Just do a good test of your self-join to make sure You filter only rows You need to filter.
If You still have problems with this, don't hesitate to leave a comment.
Interesting note: SQL Server Denali has T-SQL enhancements LEAD and LAG to access subsequent and previous row without self-joins.

which query is better and efficient - mysql

I came across writing the query in differnt ways like shown below
Type-I
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS
INNER
JOIN Countries C
ON JS.CountryID = C.CountryID
INNER
JOIN SalaryBracket S
ON JS.MinSalaryID = S.SalaryID
INNER
JOIN DegreeDisciplines DD
ON JS.DegreeDisciplineID = DD.DisciplineID
INNER
JOIN DegreeType DT
ON JS.DegreeTypeID = DT.DegreeTypeID
WHERE
JS.ShowCV = 'Yes'
Type-II
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS, Countries C, SalaryBracket S, DegreeDisciplines DD
, DegreeType DT
WHERE
JS.CountryID = C.CountryID
AND JS.MinSalaryID = S.SalaryID
AND JS.DegreeDisciplineID = DD.DisciplineID
AND JS.DegreeTypeID = DT.DegreeTypeID
AND JS.ShowCV = 'Yes'
I am using Mysql database
Both works really well, But I am wondering
which is best practice to use all time for any situation?
Performance wise which is better one?(Say the database as a millions records)
Any advantages of one over the other?
Is there any tool where I can check which is better query?
Thanks in advance
1- It's a no brainer, use the Type I
2- The type II join are also called 'implicit join', whereas the type I are called 'explicit join'. With modern DBMS, you will not have any performance problem with normal query. But I think with some big complex multi join query, the DBMS could have issue with the implicit join. Using explicit join only could improve your explain plan, so faster result !
3- So performance could be an issue, but most important maybe, the readability is improve for further maintenance. Explicit join explain exactly what you want to join on what field, whereas implicit join doesn't show if you make a join or a filter. The Where clause is for filter, not for join !
And a big big point for explicit join : outer join are really annoying with implicit join. It is so hard to read when you want multiple join with outer join that explicit join are THE solution.
4- Execution plan are what you need (See the doc)
Some duplicates :
Explicit vs implicit SQL joins
SQL join: where clause vs. on clause
INNER JOIN ON vs WHERE clause
in the most code i've seen, those querys are done like your Type-II - but i think Type-I is better because of readability (and more logic - a join is a join, so you should write it as a join (althoug the second one is just another writing style for inner joins)).
in performance, there shouldn't be a difference (if there is one, i think the Type-I would be a bit faster).
Look at "Explain"-syntax
http://dev.mysql.com/doc/refman/5.1/en/explain.html
My suggestion.
Update all your tables with some amount of records. Access the MySQL console and run SQL both command one by one. You can see the time execution time in the console.
For the two queries you mentioned (each with only inner joins) any modern database's query optimizer should produce exactly the same query plan, and thus the same performance.
For MySQL, if you prefix the query with EXPLAIN, it will spit out information about the query plan (instead of running the query). If the information from both queries is the same, them the query plan is the same, and the performance will be identical. From the MySQL Reference Manual:
EXPLAIN returns a row of information
for each table used in the SELECT
statement. The tables are listed in
the output in the order that MySQL
would read them while processing the
query. MySQL resolves all joins using
a nested-loop join method. This means
that MySQL reads a row from the first
table, and then finds a matching row
in the second table, the third table,
and so on. When all tables are
processed, MySQL outputs the selected
columns and backtracks through the
table list until a table is found for
which there are more matching rows.
The next row is read from this table
and the process continues with the
next table.
When the EXTENDED keyword is used,
EXPLAIN produces extra information
that can be viewed by issuing a SHOW
WARNINGS statement following the
EXPLAIN statement. This information
displays how the optimizer qualifies
table and column names in the SELECT
statement, what the SELECT looks like
after the application of rewriting and
optimization rules, and possibly other
notes about the optimization process.
As to which syntax is better? That's up to you, but once you move beyond inner joins to outer joins, you'll need to use the newer syntax, since there's no standard for describing outer joins using the older implicit join syntax.