MySQL Optimize query to Enhance Performance for Comparison Report - mysql

I have two tables "users" and "temp_users". Now, "users" table contains millions of data and "temp_users" contains thousands. Both the table contains same sort of information but sometime some record might be missing.
So, the requirement is to compare these two tables and show the differences between them. I wrote the comparison Query but may be due to huge volume of data (millions) it is taking more than 5 minutes to execute. Any Suggestion??
The comparison query which I wrote is below:
SELECT
id,
dateTime,
phone,
address
FROM
tempUsers t1
WHERE NOT EXISTS (
SELECT id,dateTime
FROM users t2
WHERE t1.id = t2.id
OR t1.dateTime=t2.dateTime
)
The system is developed in JSP and MySQL and is deployed in Apache Tomcat
Thanks,

Two Observations:
Did you really intend to have an 'OR' in your where clause? Shouldn't it be an 'AND'? 'OR's can cause queries to run much slower if the query optimizer is unable to utilize indexes due to the 'OR' logic.
You are using a sub-select rather than a JOIN, and that can also cause a significant problem called 'correlated subquery' where the sub-select has to execute for every row being returned by the outer select.
The two issues above (correlated subquery with an OR condition) is likely what is causing the problem.
Try the following query instead:
SELECT
t1.id,
t1.dateTime,
t1.phone,
t1.address
FROM
tempUsers t1
LEFT OUTER JOIN
users t2
ON
t1.id = t2.id
AND t1.dateTime=t2.dateTime
WHERE
t2.id IS NULL
The above query performs a 'LEFT OUTER JOIN' using ID and DATETIME to join the two tables, then filters the results to only those where there is no row in USERS. This should return what you want.
If the 'OR' condition really is the logic you need, then change it in the 'ON' clause, but be prepared that it could adversely affect the speed of the query.
For additional speed: ensure that there is an index on either 'id', 'dateTime', or both.
Hope this helps!
john...

Related

Does adding join condition on two different tables (excluding the table to be joined) slows down query and performance

I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.

Does multiple table join slows down mysql

My simple question is : Does multiple table join slows down mysql performance?
I have a data set where I need to do about 6 tables JOIN, on properly indexed columns.
I read the threads like
Join slows down sql
MySQL adding join slows down whole query
MySQL multiple table join query performance issue
But the question remains still as it is.
Can someone who experienced this thing reply?
MySQL, by default, uses the Block Nested-Loop join algorithm for joins.
SELECT t1.*, t2.col1
FROM table1 t1
LEFT JOIN table2 t2
ON t2.id = t1.id
In effect, yields the same performance as a subquery like the following:
SELECT t1.*, (SELECT col1 FROM table2 t2 WHERE t2.id = t1.id)
FROM table1 t1
Indexes are obviously important to satisfy the WHERE clause in the subquery, and are used in the same fashion for join operations.
The performance of a join, assuming proper indexes, amounts to the number of lookups that MySQL must perform. The more lookups, the longer it takes.
Hence, the more rows involved, the slower the join. Joins with small result sets (few rows) are fast and considered normal usage. Keep your result sets small and use proper indexes, and you'll be fine. Don't avoid the join.
Of course, sorting results from multiple tables can be a bit more complicated for MySQL, and any time you join text or blob columns MySQL requires a temporary table, and there are numerous other details.

Very slow MySQL subquery

I have two tables that each contain about 500 customer data records. Each record in each of the tables has an email field. Sometimes the same email addresses exist on both tables, sometimes not. I want to retrieve every email address on table1 that doesn't exist on table2. The email field in each table is indexed. I'm doing the select with a sub query that is really slow, 10 to 20 seconds.
select email
from
t1
where
email not in (select email from t2)
There's actually about 30K rows in each table, but I can knock it down to 500 each very quickly with an additional 'where' to filter by category. It's only when I add that subquery that it slows down dramatically. So, I am sure this can be faster, and I know a join should be much faster than the subquery, but can't figure out how to do that. I found a left outer join explanation here on SO, that looked like it should help, but got nowhere with it. Any help is appreciated.
mysql does not optimize a subquery in the WHERE clause (edit: it re-runs the subquery for every row tested)
to convert to a JOIN, try something like
SELECT email FROM t1
LEFT JOIN t2 ON (t1.email = t2.email)
WHERE t2.email IS NULL
this should run very fast, a covering index query.
The query optimizer should walk the email index of t1, check the
email index of t2, and output those emails that are in t1 but not in t2.
Edit: I should add, mysql does optimize a subquery in the JOIN clause: it runs the subquery and puts the results into a "derived table" (temporary table without any indexes), and joins the derived table like any other. The syntax is a bit funny, each derived table must have an alias, ie ... JOIN (SELECT ...) AS derived ON ....
Usually subqueries do more processing than usual query. In your case it first fetches all the emails from t2 and compares it with the email list of t1.
You can try like below, without using a sub query.
SELECT email FROM t1,t2 WHERE t1.email!=t2.email
The best way to improve the performance of SELECT operations is to create indexes on one or more of the columns that are tested in the query. The index entries act like pointers to the table rows, allowing the query to quickly determine which rows match a condition in the WHERE clause, and retrieve the other column values for those rows. All MySQL data types can be indexed.
some tricks for creating mysql tables ..
see this.
I think this should work fine
SELECT email from T1
LEFT JOIN T2
ON T1.email=T2.email
WHERE T2.email!=NULL

How to make SQL query faster?

I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado

JOIN or INNER SELECT with IN, which is faster?

I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.