I have a query which gets data by joining 3 big tables (~1mm records each), in addition they are very busy tables.
is it better to do the traditional joins? or rather first fetch values from first table and do a secondary query passing the values retrieved as in comma delimited in clause?
Option #1
SELECT *
FROM BigTable1 a
INNER JOIN BigTable2 b using(someField2)
INNER JOIN BigTable3 c using(someField3)
WHERE a.someField1 = 'value'
vs
Option #2
$values = SELECT someField2 FROM WHERE someField1 = 'value'; #(~20-200 values)
SELECT *
FROM BigTable2
INNER JOIN BigTable3 c using(someField1)
WHERE someField2 in ($values)
Option #3
create temp-table to store these values from BigTable1
and use this instead of join to BigTable1 directly
any other option?
I think the best option is to try both approaches and run explain on them.
Finally, one optimization you could make would be to use a stored procedure for the second approach which would reduce the time/overhead of having to run 2 queries from the client.
Finally, Joining is quite an expensive operation for very large tables since your essentially projecting and selecting over 1m X 1m rows. ( terms: What are projection and selection?)
There is no definitive answer to your question and you could profile both ways since they depend on multiple factors.
However, the first approach is usually taken and should be faster if all of the tables are correctly indexed and the sizes of the rows are "standard".
Also take into account that in the second approach the latency of the network communication will be far worse since you will need multiple trips to the DB.
Related
I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'
In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.
Both SQL, return the same results. The first my joins are on the subqueries the second the final queryis a join with a temporary that previously I create/populate them
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN (
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa
) hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
DROP TEMPORARY TABLE IF EXISTS hasRemittance;
CREATE TEMPORARY TABLE hasRemittance
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa;
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
Which will have better performance for a few thousand records?
The two formulations are identical except that your explicit temp table version is 3 sql statements instead of just 1. That is, the overhead of the back and forth to the server makes it slower. But...
Since the implicit temp table is in a LEFT JOIN, that subquery may be evaluated in one of two ways...
Older versions of MySQL were 'dump' and re-evaluated it. Hence slow.
Newer versions automatically create an index. Hence fast.
Meanwhile, you could speed up the explicit temp table version by adding a suitable index. It would be PRIMARY KEY(collegiate_id). If there is a chance of that INNER JOIN producing dups, then say SELECT DISTINCT.
For "a few thousand" rows, you usually don't need to worry about performance.
Oracle has a zillion options for everything. MySQL has very few, with the default being (usually) the best. So ignore the answer that discussed various options that you could use in MySQL.
There are issues with
AND IF(notCollegiate,
c.collegiate_id NOT IN (notCollegiate),
'1=1')
I can't tell which table notCollegiate is in. notCollegiate cannot be a list, so why use IN? Instead simply use !=. Finally, '1=1' is a 3-character string; did you really want that?
For performance (of either version)
remittances needs INDEX(type_id, name, remittance_id) with remittance_id specifically last.
collegiateRemittances needs INDEX(remittance_id) (unless it is the PK).
collegiates needs INDEX(typePayment, active, exentFee , approvedBoard) in any order.
Bottom line: Worry more about indexes than how you formulate the query.
Ouch. Another wrinkle. What is getFee()? If it is a Stored Function, maybe we need to worry about optimizing it?? And what is dateS?
It depends actually. You'll have to test performance of every option. On my website I had 2 tables with articles and comments to them. It turned out it's faster to call comment counts 20 times for each article, than using a single union query. MySQL (like other DBs) caches queries, so small simple queries can run amazingly fast.
I did not saw that you have tagged the question as mysql so I initialy aswered for Oracle. Here is what I think about mySQL.
MySQL
There are two options when it comes to temporary tables Memory or Disk. And for Disk you can have MyIsam - non transactional and InnoDB transactional. Of course you can expect better performance for non transactional type of storage.
Additionaly you need to figure out how big resultset are you dealing with. For small resultset the memory option would be faster for large resultset the disk option would be faster.
Again at the end as in my original answer you need to figure out what performance is good enough and go for the most descriptive and easy to read option.
Oracle
It depends on what kind of temporary table you are dealing with.
You can have session based temporary tables - data is held until logout, or transaction based - data is held until commit . On top of this they can support transaction logging or not support it. Depending on configuration you can get better performance from a temporary table.
As everything in the world performance is relative therm. Most probably for few thousand records it will not do significant difference between the two queries. In which case I would go not for the most performant on but for the most easier to read and understand one.
By simple logic Id think yeah, is faster because the DBMS brings less info and needs less memory...however, I dont have a valid argument why could be faster.
If for example, I want to have a select from 2 related tables, with index and everything.
But I want to know why select tableA.field, tableA.field2, tableA.field3, tableBfield1, tableB,field2 from tableA, tableB
is actually faster than
select * from tableA,tableB
Both tables have about 3 million records and table A has about 14 fields and tableB got 18.
Any idea?
Thanks.
Reducing the number of fields selected means that less data has to be transmitted from the server to the client. It also reduces the amount of memory that the server and client have to use to hold the data selected. So these should improve performance once the server determines which rows should be in the result set.
It's not likely to have any significant impact on the speed of processing the query itself within the database server. That's dominated by the cost of joining the tables, filtering the rows based on the WHERE clause, and performing any calculations specified in the SELECT clause. These are all independent of the columns being selected. If you use EXPLAIN on the two queries, you won't see any difference.
you are joining two tables with 3 million rows each with no filter. that will make 9x10^12 rows. generating and transmitting to the client a resultset of a few fields, against all 32 fields will make a difference.
If you select all fields in the first query it's the same thing because you request the same amount of data. Check this http://sqlfiddle.com/#!9/27987/2
Maybe the difference of perfomance has another reason...like...other selects in running.
Essentially select * from tableA,tableB is the equivalent of the Cartesian product of the two tables, for a total of 3million x 3 million of rows.
Therefore:
select * from tableA,tableB
With the wildcards * you retrieve a table of 9million x 28 columns, while
select tableA.field, tableA.field2, tableA.field3, tableB.field1, tableB.field2 from tableA, tableB
with the explicit form you have a table of 9million x 5 columns...so less data!
To make the system more efficient, should we reduce the number of database IO or reduce the size of data operation?
More specifically, suppose I want to get top 60-70 objects.
1st approach:
By joining several tables, I got a huge table here. Then sorting the table based on some attributes, and return the top 70 objects with all its attributes and I only use the 60-70 objects.
2nd approach:
By joining less tables and sorting them, I got top 70 objects' ids, and then I do a second lookup for 60-70 objects based on their ids.
So which one is better in terms of efficiency, esp for MySQL.
It will depend on how you designed your query.
Usually JOIN operations are more efficient than using IN (group) or nested SELECTs, but when joining 3 or more tables you have to choose carefully the order to optimize it.
And of course, every table bind should envolve a PRIMARY KEY.
If the query remain too slow despite of your efforts, then you should use a cache. A new table, or even a file that will store the results of this query up to a given expiration time when it should be updated.
This is a common practice when the results of a heavy query are needed frequently in the system.
You can always count on MySQL Workbench to measure the speed of your queries and play with your options.
Ordinarily, the best way to take advantage of query optimization is to combine the two approaches you present.
SELECT col, col, col, col, etc
FROM tab1,
JOIN tabl2 ON col = col
JOIN tabl3 ON col = col
WHERE tab1.id IN
( SELECT distinct tab1.id
FROM whatever
JOIN whatever ON col = col
WHERE whatever
ORDER BY col DESC
LIMIT 70
)
See how that goes? You make a subquery to select the IDs, then use it in the main query.
I have the following question. I have two tables and I want to store them separately. However, in my application I need to regularly perform UNION operation on them, so I effectively need to treat them as one. I think this is not good in terms of performance. I was thinking about creating the following view:
CREATE VIEW merged_tables(fields from both tables) AS
SELECT * FROM table_a
UNION
SELECT * FROM table_b
How does it impact on the performance of SELECT query? Does it have any real impact (inner representation is different) or it is just a matter of using a simpler query to select?
Using a UNION inside a view will be no different in performance than using the UNION in a discrete SELECT statement. Since you are using SELECT * on both tables and require no inner complexity (e.g. no JOINS or WHERE clauses), there won't really be any way to further optimize it.
However, if the tables do hold similar data and you want to keep them logically separated, you might consider storing it all in one table with a boolean column that indicates whether a row would otherwise have been a resident of table_a or table_b. You gain a way to tell the rows apart and avoid the added confusion of the UNION then, and performance isn't significantly impacted either.
It is just a matter of using a simpler query to select. There will be no speed difference, however the cost of querying the union should not be much worse (in most cases) than if you kept all the data in a single table.
If the tables are really the same structure, you could also consider another design in which you used a single table for storage and two views to logically separate the records:
CREATE VIEW table_a AS SELECT * FROM table_all WHERE rec_type = 'A'
CREATE VIEW table_b AS SELECT * FROM table_all WHERE rec_type = 'B'
Because these are single-table, non-aggregated VIEWs you can use them like tables in INSERT, UPDATE, DELETE, and SELECT but you have the advantage of also being able to program against table_all when it makes sense. The advantage here over your solution is that you can update against either the table_a, table_b entities or the table_all entity, and you don't have two physical tables to maintain.