A strategy to join a large number of tables on multiple columns? - mysql

I know that I can join 2-3 small tables easily by writing simple joins. However, these joins can become very slow when you have 7-8 tables with 20 million+ rows, joining on 1-3 columns,
even when you have the right indices. Moreover, the query becomes long and ugly too.
Is there an alternative strategy for doing such big joins, preferably database agnostic?
EDIT
Here is pseudocode for the join. Note that some tables may have to be unpivoted before they are used in the join -
select * from
(select c1,c2,c3... From t1 where) as s1
inner join
(select c1,... From t2 where) as s2
inner join
(unpivot table to get c1,c2... From t3 where) as s3
inner join
(select c1,c2,c3... From t2 where) as s4
on
(s1.c1 = s2.c1)
and
(s1.c1 = s3.c1 and s1.c2 = s3.c2)
and
(s1.c1 = s4.c1 and s2.c2 = s4.c2 and s1.c3 = s4.c3)
Clearly, this is complicated and ugly. Is there a way to get the same result set in a much neater way without using such a complex join?

"7-8 tables" doesn't sound worrying at all. Modern RDBMS can handle a lot more.
Your pseudo-code query can be radically simplified to this form:
SELECT a.c1 AS a_c1, a.c2 AS a_c2, ... -- use column aliases ...
,b.c1, b.c2, ... -- .. If you really have same names more than once
,c.c1, c.c2, ...
,d.c1, d.c2, ...
FROM t1 a
JOIN t2 b USING (c1)
JOIN (unpivot table to get c1,c2... From t3 where) c USING (c1,c2)
JOIN t2 d ON d.c1 = a.c1 AND d.c2 = b.c2 AND d.c3 = d.c3
WHERE <some condition on a>
AND <more conditions> ..
As long as matching column names are unambiguous in the tables left of a JOIN, the USING syntax shortens the code. If anything can be ambiguous, use the explicit form demonstrated in my last join condition. That's all standard SQL, but according to this Wikipedia page:
The USING clause is not supported by MS SQL Server and Sybase.
It wouldn't make sense to use all those subqueries in your pseudo-code in most RDBMS. The query planner finds the best way to apply conditions and fetch columns itself. Smart query planners also rearrange tables in any order they see fit to arrive at a fast query plan.
Also, that thing called "database agnostic" only exists in theory. None of the major RDBMS completely implements the SQL standard and all of them have different weaknesses and strengths. You have to optimize for your RDBMS or get mediocre performance at best.
Indexing strategies are very important. 20 million rows doesn't matter much in a SELECT, as long as we can plug a hand full of row pointers from an index. Indexing strategies heavily depend on your brand of RDBMS. Columns:
You JOIN on,
Have WHERE conditions on,
Or are used in ORDER BY
May benefit from an index.
There are also various types of indexes, designed for various requirements. B-tree, GIN, GiST, . Partial, multicolumn, functional, covering. Various operator classes. To optimize performance you just need to know the basics and the capabilities of your RDBMS.
The excellent PostgreSQL manual on indexes to give you an overview.

I have seen three ways of handling this if indexing fails to give a big enough performance boost.
The first is to use temp tables. The more joins the database performed, the wores the estimated rows gets which can really slow down your query. If you run your joins and where clauses that will return the smallest number of rows, and store the intermediate results in a temp table to allow the cardinality estimator a more accurate count performance can improve significantly. This solution is the only once that doesn't create new database objects.
The second solution is a database warehouse, or at least an additional denormalized table(s). In this case you would create an additional table to hold the final results of the query, or several tables that perform the major joins and hold intermediate results. As an example, if you had a customers table, and three other tables that hold information about a customer, you could create a new table that holds the result of joining thsoe four tables. This solution generally works when you are using this query for reports and you can load the report table(s) each night with the new data generated during the day. This solution will be faster than the first, but is harder to implement and keep the results current.
The third solution is a materilized view/ indexed view. This solution depends heavily on the db platform you use. Oracle and Sql Server both have a way to create a view and then index it, giving you greater performance on the view. This can come at the cost of not having current records or greater data cost to store the view results but it can help.

Create materialized views and refresh them over the night. Or refresh them only when consider necessary. For example, you can have 2 views, one materialized with old data that is not going to be ever changed, and another normal view with actual data. And then a union between these. So you could have more views like these for any output you need.
If your database engine doesn't support materialized views, just denormalize the old data in another table over the night.
Check this also: Refresh a Complex Materialized View

I've been in same situation before, and my strategy was use WITH clause.
See more here.
WITH
-- group some tables into a "temporary" view called MY_TABLE_A
MY_TABLE_A AS
(
SELECT T1.FIELD1, T2.FIELD2, T3.FIELD3
FROM T1
JOIN T2 ON T2.PKEY = T1.FKEY
JOIN T3 ON T3.PKEY = T2.FKEY
),
-- group some tables into another "temporary" view called MY_TABLE_B
MY_TABLE_B AS
(
SELECT T4.FIELD1, T5.FIELD2, T6.FIELD3
FROM T4
JOIN T5 ON T5.PKEY = T4.FKEY
JOIN T6 ON T6.PKEY = T5.FKEY
)
-- use those views
SELECT A.FIELD2, B.FIELD3
FROM MY_TABLE_A A
JOIN MY_TABLE_B B ON B.FIELD1 = A.FIELD1
WHERE A.FIELD3 = "X"
AND B.FIELD2 = "Y"
;

If you want to know if there is another way to access the data. One approach would be to take an interest in the object concept. I any event on Oracle. it's works very well and simplify dev.
But it requires a business object approach.
From your example we can use two concept :
Reference
Inherence
Who can ease the readability of a query and sometimes speed.
1 : References
A reference is a pointer to an object. It allows the removal of joins between tables as they will be pointed.
Here is a simple Exemple :
CREATE TYPE S7 AS OBJECT (
id NUMBER(11)
, code NUMBER(11)
, label2 VARCHAR2(1024 CHAR)
);
CREATE TABLE S7_tbl OF S7 (
CONSTRAINT s7_k PRIMARY KEY(id)
);
CREATE TABLE S8 (
info VARCHAR2(500 CHAR)
, info2 NUMBER(5)
, ref_s7 REF S7 -- creation of the reference
);
We insert some datas in both table :
INSERT INTO S7_tbl VALUES ( S7 (1,1111, 'test'));
INSERT INTO S7_tbl VALUES ( S7 (2,2222, 'test2'));
INSERT INTO S7_tbl VALUES ( S7 (3,3333, 'test3'));
--
INSERT INTO S8 VALUES ('text', 22, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text2', 44, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text3', 11, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 2222));
And the SELECT :
SELECT s8.info, s8.info2 FROM S8 s8 WHERE s8.ref_s7.code = 1111;
RETURN :
text2 | 44
text | 22
Here is a type of implicit join
2 : inherence
CREATE TYPE S6 AS OBJECT (
name VARCHAR2(255 CHAR)
, date_start DATE
)
/
DROP TYPE S1;;
CREATE TYPE S1 AS OBJECT(
data1 NUMBER(11)
, data2 VARCHAR(255 CHAR)
, data3 VARCHAR(255 CHAR)
) INSTANTIABLE NOT FINAL
/
CREATE TYPE S2 UNDER S1 (
dummy1 VARCHAR2(1024 CHAR)
, dummy2 NUMBER(11)
, dummy3 NUMBER(11)
, info_s6 S6
) INSTANTIABLE FINAL
/
CREATE TABLE S5
(
info1 VARCHAR2(128 CHAR)
, info2 NUMBER(6)
, object_s2 S2
)
We just insert a row in the table
INSERT INTO S5
VALUES (
'info'
, 2
, S2(
1 -- fill data1
, 'xxx' -- fill data2
, 'yyy' -- fill data3
, 'zzz' -- fill dummy1
, 2 -- fill dummy2
, 4 -- fill dummy3
, S6(
'example1'
,SYSDATE
)
)
);
And the SELECT :
SELECT
s.info1
, s.objet_s2.data1
,s.objet_s2.dummy1
,s.objet_s2.info_s6.name
FROM S5 s;
We can see that by this method we can easily access related data without using.
hoping that it can serve you

if it's all subqueries you can do it in the sub queries for each and as all the matching data happens it should be as simple as below so long as all the tables c1,c2,c3
select * from
(select c1,c2,c3... from t1) as s1
inner join
(select c1,... from t2 where c1 = s1.c1) as s2
inner join
(unpivot table to get c1,c2... from t3 where c2 = s2.c2) as s3
inner join
(select c1,c2,c3... from t2 where c3 = s3.c3) as s4

You can make use of views and functions. Views make SQL code elegant and easy to read and compose. Functions can return single values or rowsets permitting fine-tuning the underlying code for efficiency. Finally, filtering at subquery level instead of joining and filtering at query level permits the engine produce smaller sets of data to join later, where indices are not that significant since the amount of data to join is small and can be efficiently computed on the fly. Something like the query below can be include highly complex queries involving dozens of tables and complex business logic hidden in views and functions, and still be very efficient.
SELECT a.*, b.*
FROM (SELECT * FROM ComplexView
WHERE <filter that limits output to a few rows>) a
JOIN (SELECT x, y, z FROM AlreadySignificantlyFilteredView
WHERE x IN (SELECT f_XValuesForDate(CURRENT_DATE))) b
ON (a.x = b.x AND a.y = b.y AND a.z <= b.z)
WHERE <condition for filtering even further>

Related

Alternative to NOT IN?

I need to check two tables and find inconsistencies, ie where the value of table T1 is not present in the italy_cities table. I'll explain:
T1: Includes personal data (with place of birth)
italy_city: Includes all the municipalities of Italy.
Table T1 has about 9000 tuples.
T2 has 7,903 tuples.
Using "NOT IN" the query takes approximately 16 seconds to execute.
Here is the query:
SELECT
`T1`.*
FROM
T1
WHERE
(
`T1`.place NOT IN ( SELECT municipality FROM italy_cities )
)
MY QUESTION
what is the best and fast option to check for inconsistencies? to check all the "incorrect" municipalities that do not exist in the official database?
Thanks in advance
I generally recommend NOT EXISTS for this purpose:
SELECT T1.*
FROM T1
WHERE NOT EXISTS (SELECT 1
FROM italy_cities ic
WHERE t1.place = ic.municipality
);
Why? There are two reasons:
NOT IN does not do what you expect if the subquery returns any NULL values. If even one value is NULL all rows end up being filtered out.
This version of the query can take advantage of an index on italy_cities(municipality) which seems like a reasonable index on the table.
Not exists can perform better but there is also another way which is left join as follows:
SELECT T1.*
FROM T1
LEFT JOIN italy_cities I ON I.municipality = T1.PLACE
WHERE I.municipality IS NULL;

Optimising a SQL query with a huge where clause

I am working on a system (with Laravel) where users can fill a few filters to get the data they need.
Data is not prepared real time, once the filters are set, a job is pushed to the queue and once the query finishes a CSV file is created. Then the user receives an email with the file which was created so that they can download it.
I have seen some errors in the jobs where it took longer than 30 mins to process one job and when I checked I have seen some users created filter with more than 600 values.
This filter values are translated like this:
SELECT filed1,
field2,
field6
FROM table
INNER JOIN table2
ON table.id = table2.cid
/* this is how we try not to give same data to the users again so we used NOT IN */
WHERE table.id NOT IN(SELECT data_id
FROM data_access
WHERE data_user = 26)
AND ( /* this bit is auto populated with the filter values */
table2.filed_a = 'text a'
OR table2.filed_a = 'text b'
OR table2.filed_a = 'text c' )
Well I was not expecting users to go wild and fine tune with a huge filter set. It is okay for them to do this but need a solution to make this query quicker.
One way is to create a temp table on the fly with the filter values and covert the query for INNER JOIN but not sure if it would increase the performance.
Also, given that in a normal day system would need to create at least 40-ish temp tables and delete them afterwards. Would this become another issue in the long run?
I would love to hear any other suggestions that may help me solve this issue other then temp table method.
I would suggest writing the query like this:
SELECT ?.filed1, ?.field2, ?.field6 -- qualify column names (but no effect on performance)
FROM table t JOIN
table2 t2
ON t.id = t2.cid
WHERE NOT EXISTS (SELECT 1
FROM data_access da
WHERE t.id = da.data_id AND da.data_user = 26
) AND
t2.filed_a IN ('text a', 'text b', 'text c') ;
Then I would recommend indexes. Most likely:
table2(filed_a, cid)
table1(id) (may not be necessary if id is already the primary key)
data_access(data_id, data_user)
You can test this as your own query. I don't know how to get Laravel to produce this (assuming it meets your performance objectives).

MySQL JOIN tables with WHERE clause

I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.

merging tables which consist of 17 million records

I have 3 tables in which 2 tables have 200 000 records and another table of 1 800 000 records. I do merge these 3 tables using 2 contraints that is OCN and TIMESTAMP(month,year). first two tables has columns for month and year as Monthx (which includes both month,date and year). and other table as seperate columns for each month and year. I gave the query as,
mysql--> insert into trail
select * from A,B,C
where A.OCN=B.OCN
and B.OCN=C.OCN
and C.OCN=A.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(A.Monthx)=year(B.Monthx)
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
I gave this query 4days before its still running.could u tell me whether this query is correct or wrong and provide me a exact query..(i gave tat '%b' because my C table has a column which has months in the form JAN,MAR).
Please don't use implicit where joins, bury it in 1989, where it belongs. Use explicit joins instead
select * from a inner join b on (a.ocn = b.ocn and
date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b') ....
This select part of the query (had to rewrite it because I refuse to deal with '89 syntax)
select * from A
inner join B on (
A.OCN=B.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and year(A.Monthx)=year(B.Monthx)
)
inner join C on (
C.OCN=A.OCN
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
)
Has a lot of problems.
using a function on a field will kill any opportunity to use an index on that field.
you are doing a lot of duplicate test. if (A = B) and (B = C) then it logically follows that (A = C)
the translations of the date fields take a lot of time
I would suggest you rewrite your tables to use fields that don't need translating (using functions), but can be compared directly.
A field like yearmonth : char(6) e.g. 201006 can be indexed and compared much faster.
If the table A,B,C have a field called ym for short than your query can be:
INSERT INTO TRAIL
SELECT a.*, b.*, c.* FROM a
INNER JOIN b ON (
a.ocn = b.ocn
AND a.ym = b.ym
)
INNER JOIN c ON (
a.ocn = c.ocn
AND a.ym = c.ym
);
If you put indexes on ocn (primary index probably) and ym the query should run about a million rows a second (or more).
To test if your query is ok, import a small subset of records from A, B and C to a temporary database and test it their.
You have redundancies in your implicit JOIN because you are joining A.OCN with B.OCN, B.OCN with C.OCN and then C.OCN to A.OCN, on of those can be deleted. If A.OCN = B.OCN and B.CON = C.OCN, A.OCN = C.OCN is implied. Further, I guess you have redundancies in your date comparisons.

MySQL versus SQL Server Express Performance Comparison

I have a somewhat complex query with roughly 100K rows.
The query runs in 13 seconds in SQL Server Express (run on my dev box)
The same query with the same indexing and tables takes over 15+ minutes to run on MySQL 5.1 (run on my production box - much more powerful and tested with 100% resources) And sometimes the query crashes the machine with an out of memory error.
What am I doing wrong in MySQL? Why does it take so long?
select e8.*
from table_a e8
inner join (
select max(e6.id) as id, e6.category, e6.entity, e6.service_date
from (
select e4.*
from table_a e4
inner join (
select max(e2.id) as id, e3.rank, e2.entity, e2.provider_id, e2.service_date
from table_a e2
inner join (
select min(e1.rank) as rank, e1.entity, e1.provider_id, e1.service_date
from table_a e1
where e1.site_id is not null
group by e1.entity, e1.provider_id, e1.service_date
) as e3
on e2.rank= e3.rank
and e2.entity = e3.entity
and e2.provider_id = e3.provider_id
and e2.service_date = e3.service_date
and e2.rank= e3.rank
group by e2.entity, e2.provider_id, e2.service_date, e3.rank
) e5
on e4.id = e5.id
and e4.rank= e5.rank
) e6
group by e6.category, e6.entity, e6.service_date
) e7
on e8.id = e7.id and e7.category = e8.category
This answer I originally attempted to post to your deleted question which did not indicate that it was a problem with MySQL. I would still go ahead and use SQL Server to refactor the query using the CTEs and then convert back to nested queries (if any remain). Sorry about the formatting, Jeff Atwood sent me the original posted text and I had to reformat it again.
It's hard to do without data, expected results and good names, but I would convert all the nested queries into CTEs, stack them up, name them meaningfully and refactor - starting with excluding the columns which you aren't using. Removing the columns is not going to result in the improvement, because the optimizer is pretty smart - but it WILL give you the ability to improve your query - probably factoring out some or all of the CTEs. I'm not sure what your code is doing, but you may find the new RANK()-type functions useful, because it appears you are using a seek-back type of pattern with all these self-joins.
So start from here instead. I've looked at e7 improvements for you, the columns unused from e7 may indicate either a defect or incomplete thinking about the grouping possibilities, but if those columns are truly unnecessary, then this may trickle all the way back through your logic in e6, e5 and e3. If the grouping in e7 is correct then you can eliminate everything but max(id) in the results and the join. I cannot see why you would have multiple MAX(id) per category, because this would multiply your results when you join, so the MAX(id) must be unique within the category, in which case the category is redundant in the join.
WITH e3 AS (
select min(e1.rank) as rank,
e1.entity,
e1.provider_id,
e1.service_date
from table_a e1
where e1.site_id is not null
group by e1.entity, e1.provider_id, e1.service_date
)
,e5 AS (
select max(e2.id) as id,
e3.rank,
e2.entity,
e2.provider_id,
e2.service_date
from table_a e2
inner join e3
on e2.rank= e3.rank
and e2.entity = e3.entity
and e2.provider_id = e3.provider_id
and e2.service_date = e3.service_date
and e2.rank= e3.rank
group by e2.entity, e2.provider_id, e2.service_date, e3.rank
)
,e6 AS (
select e4.* -- switch from * to only the columns you are actually using
from table_a e4
inner join e5
on e4.id = e5.id
and e4.rank= e5.rank
)
,e7 AS (
select max(e6.id) as id, e6.category -- unused, e6.entity, e6.service_date
from e6
group by e6.category, e6.entity, e6.service_date
-- This instead
-- select max(e6.id) as id
-- from e6
-- group by e6.category, e6.entity, e6.service_date
)
select e8.*
from table_a e8
inner join e7
on e8.id = e7.id
and e7.category = e8.category
-- THIS INSTEAD on e8.id = e7.id
100,000 rows shouldn't take 13 seconds if efficient indexes were available. I suspect the difference is due to the fact that SQL server has a much more robust query optimizer than MySQL. What MySQL has is more on the order of an SQL Parser than an Optimizer.
You'll need to provide a lot more information - full schemas of all participating tables, and full list of indexes on each, for starters.
Then some idea of what the data is about, and what the query is intended to produce. Something on the order of a Use Case.
It'd be interesting to EXPLAIN PLAN with both to see what the differences were. I'm not sure if it's an apple and orange comparison, but I'd be curious.
I don't know if this can help, but this was the first hit on a search for "mysql query optimizer".
Here's another one that might be worthwhile.
The only open source database that I know who have CTEs is Firebird (http://www.firebirdsql.org/rlsnotesh/rlsnotes210.html#rnfb210-cte)
Postgres will have in 8.4 I think