I am looking at a few queries for performance and made a change to a query, which is based on the following examples. The change turned a 6 minute query into one which completes in few seconds and I was wondering why? How has this altered things to such an extent?
In the example, please assume the BOOK table to contain the general details for all books in a library and the FORMATS table contains details, such as HARDBACK, PAPERBACK and eBOOK (allowing for new formats to be added) where there is a key (called FORMATID) linking the two tables.
Query executes in 6 minutes
select b.bookid, f.formatname
from book b
inner join formats f on f.formatid = b.formatid
select b.bookid, f.formatname
from book b
left join formats f on f.formatid = b.formatid
Query executes in 12 seconds
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
where b.formatid is not null
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
In the above, the first query of each pair achieves INNER JOIN results and the second, achieves LEFT JOIN. The results difference on my database is 295166 and 295376 rows; the ties differences remain pretty much the same.
[added] For confirmation; I have tested this (with the same results) by creating the two test tables mentioned herein, populating the BOOKS table with ~1 million rows and NOT applying any index or other optimisation.
Lets say I have two tables student, records with the schema being
Students (id, name)
Records (rid,sid,subject,marks)
and I want to print (name, subject,marks).
So I can write the inner join in two ways
> select a.name,b.subject,b.marks from students a, records b where a.id = b.sid;
or
> select a.name,b.subject,b.marks from students a inner join records b on a.id = b.sid;
Obviously, they both are returning the same results and taking same amount of time. So I am not sure if internally they both are same or if there is any scenario where either of those is preferable over the other?
Both are wrong. I assume this is a mistake, and the first where is supposed to be a from:
> select a.name,b.subject,b.marks from students a, records b where a.id = b.sid;
or
> select a.name,b.subject,b.marks from students a inner join records b on a.id = b.sid;
If we disregard this mistake and examine the queries above - these two statements are functionally equivalent, but implicit joins (the first form) has been deprecated for a long while. Hence, it's suggested to use explicit joins (the second form). An added bonus to them is the increased readability of the code - the join conditions are neatly arranged with the joins, and the where clause is left free to handle just the logic of the query.
They are the same and are executed the same way. Do a
EXPLAIN EXTENDED SELECT ...
on your queries and enable the warnings. Then mySQL will give you a warning containing the query after the optimizer had it. There should be the same warning for both queries.
I know that I can join 2-3 small tables easily by writing simple joins. However, these joins can become very slow when you have 7-8 tables with 20 million+ rows, joining on 1-3 columns,
even when you have the right indices. Moreover, the query becomes long and ugly too.
Is there an alternative strategy for doing such big joins, preferably database agnostic?
EDIT
Here is pseudocode for the join. Note that some tables may have to be unpivoted before they are used in the join -
select * from
(select c1,c2,c3... From t1 where) as s1
inner join
(select c1,... From t2 where) as s2
inner join
(unpivot table to get c1,c2... From t3 where) as s3
inner join
(select c1,c2,c3... From t2 where) as s4
on
(s1.c1 = s2.c1)
and
(s1.c1 = s3.c1 and s1.c2 = s3.c2)
and
(s1.c1 = s4.c1 and s2.c2 = s4.c2 and s1.c3 = s4.c3)
Clearly, this is complicated and ugly. Is there a way to get the same result set in a much neater way without using such a complex join?
"7-8 tables" doesn't sound worrying at all. Modern RDBMS can handle a lot more.
Your pseudo-code query can be radically simplified to this form:
SELECT a.c1 AS a_c1, a.c2 AS a_c2, ... -- use column aliases ...
,b.c1, b.c2, ... -- .. If you really have same names more than once
,c.c1, c.c2, ...
,d.c1, d.c2, ...
FROM t1 a
JOIN t2 b USING (c1)
JOIN (unpivot table to get c1,c2... From t3 where) c USING (c1,c2)
JOIN t2 d ON d.c1 = a.c1 AND d.c2 = b.c2 AND d.c3 = d.c3
WHERE <some condition on a>
AND <more conditions> ..
As long as matching column names are unambiguous in the tables left of a JOIN, the USING syntax shortens the code. If anything can be ambiguous, use the explicit form demonstrated in my last join condition. That's all standard SQL, but according to this Wikipedia page:
The USING clause is not supported by MS SQL Server and Sybase.
It wouldn't make sense to use all those subqueries in your pseudo-code in most RDBMS. The query planner finds the best way to apply conditions and fetch columns itself. Smart query planners also rearrange tables in any order they see fit to arrive at a fast query plan.
Also, that thing called "database agnostic" only exists in theory. None of the major RDBMS completely implements the SQL standard and all of them have different weaknesses and strengths. You have to optimize for your RDBMS or get mediocre performance at best.
Indexing strategies are very important. 20 million rows doesn't matter much in a SELECT, as long as we can plug a hand full of row pointers from an index. Indexing strategies heavily depend on your brand of RDBMS. Columns:
You JOIN on,
Have WHERE conditions on,
Or are used in ORDER BY
May benefit from an index.
There are also various types of indexes, designed for various requirements. B-tree, GIN, GiST, . Partial, multicolumn, functional, covering. Various operator classes. To optimize performance you just need to know the basics and the capabilities of your RDBMS.
The excellent PostgreSQL manual on indexes to give you an overview.
I have seen three ways of handling this if indexing fails to give a big enough performance boost.
The first is to use temp tables. The more joins the database performed, the wores the estimated rows gets which can really slow down your query. If you run your joins and where clauses that will return the smallest number of rows, and store the intermediate results in a temp table to allow the cardinality estimator a more accurate count performance can improve significantly. This solution is the only once that doesn't create new database objects.
The second solution is a database warehouse, or at least an additional denormalized table(s). In this case you would create an additional table to hold the final results of the query, or several tables that perform the major joins and hold intermediate results. As an example, if you had a customers table, and three other tables that hold information about a customer, you could create a new table that holds the result of joining thsoe four tables. This solution generally works when you are using this query for reports and you can load the report table(s) each night with the new data generated during the day. This solution will be faster than the first, but is harder to implement and keep the results current.
The third solution is a materilized view/ indexed view. This solution depends heavily on the db platform you use. Oracle and Sql Server both have a way to create a view and then index it, giving you greater performance on the view. This can come at the cost of not having current records or greater data cost to store the view results but it can help.
Create materialized views and refresh them over the night. Or refresh them only when consider necessary. For example, you can have 2 views, one materialized with old data that is not going to be ever changed, and another normal view with actual data. And then a union between these. So you could have more views like these for any output you need.
If your database engine doesn't support materialized views, just denormalize the old data in another table over the night.
Check this also: Refresh a Complex Materialized View
I've been in same situation before, and my strategy was use WITH clause.
See more here.
WITH
-- group some tables into a "temporary" view called MY_TABLE_A
MY_TABLE_A AS
(
SELECT T1.FIELD1, T2.FIELD2, T3.FIELD3
FROM T1
JOIN T2 ON T2.PKEY = T1.FKEY
JOIN T3 ON T3.PKEY = T2.FKEY
),
-- group some tables into another "temporary" view called MY_TABLE_B
MY_TABLE_B AS
(
SELECT T4.FIELD1, T5.FIELD2, T6.FIELD3
FROM T4
JOIN T5 ON T5.PKEY = T4.FKEY
JOIN T6 ON T6.PKEY = T5.FKEY
)
-- use those views
SELECT A.FIELD2, B.FIELD3
FROM MY_TABLE_A A
JOIN MY_TABLE_B B ON B.FIELD1 = A.FIELD1
WHERE A.FIELD3 = "X"
AND B.FIELD2 = "Y"
;
If you want to know if there is another way to access the data. One approach would be to take an interest in the object concept. I any event on Oracle. it's works very well and simplify dev.
But it requires a business object approach.
From your example we can use two concept :
Reference
Inherence
Who can ease the readability of a query and sometimes speed.
1 : References
A reference is a pointer to an object. It allows the removal of joins between tables as they will be pointed.
Here is a simple Exemple :
CREATE TYPE S7 AS OBJECT (
id NUMBER(11)
, code NUMBER(11)
, label2 VARCHAR2(1024 CHAR)
);
CREATE TABLE S7_tbl OF S7 (
CONSTRAINT s7_k PRIMARY KEY(id)
);
CREATE TABLE S8 (
info VARCHAR2(500 CHAR)
, info2 NUMBER(5)
, ref_s7 REF S7 -- creation of the reference
);
We insert some datas in both table :
INSERT INTO S7_tbl VALUES ( S7 (1,1111, 'test'));
INSERT INTO S7_tbl VALUES ( S7 (2,2222, 'test2'));
INSERT INTO S7_tbl VALUES ( S7 (3,3333, 'test3'));
--
INSERT INTO S8 VALUES ('text', 22, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text2', 44, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text3', 11, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 2222));
And the SELECT :
SELECT s8.info, s8.info2 FROM S8 s8 WHERE s8.ref_s7.code = 1111;
RETURN :
text2 | 44
text | 22
Here is a type of implicit join
2 : inherence
CREATE TYPE S6 AS OBJECT (
name VARCHAR2(255 CHAR)
, date_start DATE
)
/
DROP TYPE S1;;
CREATE TYPE S1 AS OBJECT(
data1 NUMBER(11)
, data2 VARCHAR(255 CHAR)
, data3 VARCHAR(255 CHAR)
) INSTANTIABLE NOT FINAL
/
CREATE TYPE S2 UNDER S1 (
dummy1 VARCHAR2(1024 CHAR)
, dummy2 NUMBER(11)
, dummy3 NUMBER(11)
, info_s6 S6
) INSTANTIABLE FINAL
/
CREATE TABLE S5
(
info1 VARCHAR2(128 CHAR)
, info2 NUMBER(6)
, object_s2 S2
)
We just insert a row in the table
INSERT INTO S5
VALUES (
'info'
, 2
, S2(
1 -- fill data1
, 'xxx' -- fill data2
, 'yyy' -- fill data3
, 'zzz' -- fill dummy1
, 2 -- fill dummy2
, 4 -- fill dummy3
, S6(
'example1'
,SYSDATE
)
)
);
And the SELECT :
SELECT
s.info1
, s.objet_s2.data1
,s.objet_s2.dummy1
,s.objet_s2.info_s6.name
FROM S5 s;
We can see that by this method we can easily access related data without using.
hoping that it can serve you
if it's all subqueries you can do it in the sub queries for each and as all the matching data happens it should be as simple as below so long as all the tables c1,c2,c3
select * from
(select c1,c2,c3... from t1) as s1
inner join
(select c1,... from t2 where c1 = s1.c1) as s2
inner join
(unpivot table to get c1,c2... from t3 where c2 = s2.c2) as s3
inner join
(select c1,c2,c3... from t2 where c3 = s3.c3) as s4
You can make use of views and functions. Views make SQL code elegant and easy to read and compose. Functions can return single values or rowsets permitting fine-tuning the underlying code for efficiency. Finally, filtering at subquery level instead of joining and filtering at query level permits the engine produce smaller sets of data to join later, where indices are not that significant since the amount of data to join is small and can be efficiently computed on the fly. Something like the query below can be include highly complex queries involving dozens of tables and complex business logic hidden in views and functions, and still be very efficient.
SELECT a.*, b.*
FROM (SELECT * FROM ComplexView
WHERE <filter that limits output to a few rows>) a
JOIN (SELECT x, y, z FROM AlreadySignificantlyFilteredView
WHERE x IN (SELECT f_XValuesForDate(CURRENT_DATE))) b
ON (a.x = b.x AND a.y = b.y AND a.z <= b.z)
WHERE <condition for filtering even further>
I have 3 tables in which 2 tables have 200 000 records and another table of 1 800 000 records. I do merge these 3 tables using 2 contraints that is OCN and TIMESTAMP(month,year). first two tables has columns for month and year as Monthx (which includes both month,date and year). and other table as seperate columns for each month and year. I gave the query as,
mysql--> insert into trail
select * from A,B,C
where A.OCN=B.OCN
and B.OCN=C.OCN
and C.OCN=A.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(A.Monthx)=year(B.Monthx)
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
I gave this query 4days before its still running.could u tell me whether this query is correct or wrong and provide me a exact query..(i gave tat '%b' because my C table has a column which has months in the form JAN,MAR).
Please don't use implicit where joins, bury it in 1989, where it belongs. Use explicit joins instead
select * from a inner join b on (a.ocn = b.ocn and
date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b') ....
This select part of the query (had to rewrite it because I refuse to deal with '89 syntax)
select * from A
inner join B on (
A.OCN=B.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and year(A.Monthx)=year(B.Monthx)
)
inner join C on (
C.OCN=A.OCN
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
)
Has a lot of problems.
using a function on a field will kill any opportunity to use an index on that field.
you are doing a lot of duplicate test. if (A = B) and (B = C) then it logically follows that (A = C)
the translations of the date fields take a lot of time
I would suggest you rewrite your tables to use fields that don't need translating (using functions), but can be compared directly.
A field like yearmonth : char(6) e.g. 201006 can be indexed and compared much faster.
If the table A,B,C have a field called ym for short than your query can be:
INSERT INTO TRAIL
SELECT a.*, b.*, c.* FROM a
INNER JOIN b ON (
a.ocn = b.ocn
AND a.ym = b.ym
)
INNER JOIN c ON (
a.ocn = c.ocn
AND a.ym = c.ym
);
If you put indexes on ocn (primary index probably) and ym the query should run about a million rows a second (or more).
To test if your query is ok, import a small subset of records from A, B and C to a temporary database and test it their.
You have redundancies in your implicit JOIN because you are joining A.OCN with B.OCN, B.OCN with C.OCN and then C.OCN to A.OCN, on of those can be deleted. If A.OCN = B.OCN and B.CON = C.OCN, A.OCN = C.OCN is implied. Further, I guess you have redundancies in your date comparisons.
As title says, this issue happens in MS Access 2003 SP1. Does anyone know what could be solution for this problem?
Pseudo query
select * from a inner join b on a.id=b.id
For small data sets, there are any number of approaches using various possible string conversions.
But if your data sets are of any size at all, this will be very slow because it can't use the indexes.
You could possibly optimize by joining case insensitively and then using criteria to test whether the case is the same, e.g.:
SELECT *
FROM a INNER JOIN b ON a.id=b.id
WHERE Asc(a.id) <> Asc(b.id)
This would at least allow the use of an index join so you wouldn't be comparing "a" to "b" and "a" to "c" (as would be the case with joining on string functions), but only "a" to "a" and "a" to "A".
I would suggest that if your data really needs to distinguish case, then you probably need to store it in a database engine that can distinguish case in joins and then pass your SQL off from Access to that database engine (with a passthrough query, for example).
EDIT:
#apenwarr suggests using StrComp() in the JOIN (as did #butterchicken yesterday), and this SQL raises a question for me (I've updated her/his SQL to use the same table and fieldnames I use above; it's essentially the same as #butterchicken's SQL)
SELECT *
FROM a INNER JOIN b
ON a.id = b.id
AND StrComp(a.id, b.id, 0) = 0
It is a fact that Jet will optimize a JOIN on an index exactly the same way it would optimize the equivalent WHERE clause (i.e., implicit JOIN). Stripped down to just the JOIN (presumably on indexed fields), these two SQL statements will be optimized identically by Jet:
SELECT *
FROM a INNER JOIN a
ON a.id = b.id
SELECT *
FROM a, b
WHERE a.id = b.id
My question is whether or not these three will optimize identically:
SELECT *
FROM a INNER JOIN b
ON a.id = b.id
AND StrComp(a.id, b.id, 0) = 0
SELECT *
FROM a INNER JOIN b
ON a.id = b.id
WHERE StrComp(a.id, b.id, 0) = 0
SELECT *
FROM a, b
WHERE a.id = b.id
AND StrComp(a.id, b.id, 0) = 0
I'm using SO to avoid work I'm supposed to do for tomorrow, so don't have time to create a sample database and set up SHOWPLAN to test this, but the OP should definitely give it a try and report back on the results (assuming he/she is definitely intending to do this with Jet).
Have you tried StrComp? Not an Access-bod, but I think the query would look something like this:
SELECT * FROM MyTable A INNER JOIN MyTable B
on StrComp(A.description,B.description, 0) = 0
The 0 argument in StrComp causes a binary compare which will catch differences between "You" and "you" (for example).
I believe that SQL in Access is not case sensitive. Try this Kb article for a possible solution.
Here's another possible solution. As mentioned in the posting, you are sacrificing speed...
Here's a variation of the earlier suggestion to use StrComp, but this one will start by narrowing down the join using indexes, then restrict it further using StrComp.
SELECT * FROM MyTable A INNER JOIN MyTable B
on A.description = B.description
AND StrComp(A.description,B.description, 0) = 0
This should be fast even on large data sets, as long as you don't have lots of entries that are distinguished only by case.