I know that I can join 2-3 small tables easily by writing simple joins. However, these joins can become very slow when you have 7-8 tables with 20 million+ rows, joining on 1-3 columns,
even when you have the right indices. Moreover, the query becomes long and ugly too.
Is there an alternative strategy for doing such big joins, preferably database agnostic?
EDIT
Here is pseudocode for the join. Note that some tables may have to be unpivoted before they are used in the join -
select * from
(select c1,c2,c3... From t1 where) as s1
inner join
(select c1,... From t2 where) as s2
inner join
(unpivot table to get c1,c2... From t3 where) as s3
inner join
(select c1,c2,c3... From t2 where) as s4
on
(s1.c1 = s2.c1)
and
(s1.c1 = s3.c1 and s1.c2 = s3.c2)
and
(s1.c1 = s4.c1 and s2.c2 = s4.c2 and s1.c3 = s4.c3)
Clearly, this is complicated and ugly. Is there a way to get the same result set in a much neater way without using such a complex join?
"7-8 tables" doesn't sound worrying at all. Modern RDBMS can handle a lot more.
Your pseudo-code query can be radically simplified to this form:
SELECT a.c1 AS a_c1, a.c2 AS a_c2, ... -- use column aliases ...
,b.c1, b.c2, ... -- .. If you really have same names more than once
,c.c1, c.c2, ...
,d.c1, d.c2, ...
FROM t1 a
JOIN t2 b USING (c1)
JOIN (unpivot table to get c1,c2... From t3 where) c USING (c1,c2)
JOIN t2 d ON d.c1 = a.c1 AND d.c2 = b.c2 AND d.c3 = d.c3
WHERE <some condition on a>
AND <more conditions> ..
As long as matching column names are unambiguous in the tables left of a JOIN, the USING syntax shortens the code. If anything can be ambiguous, use the explicit form demonstrated in my last join condition. That's all standard SQL, but according to this Wikipedia page:
The USING clause is not supported by MS SQL Server and Sybase.
It wouldn't make sense to use all those subqueries in your pseudo-code in most RDBMS. The query planner finds the best way to apply conditions and fetch columns itself. Smart query planners also rearrange tables in any order they see fit to arrive at a fast query plan.
Also, that thing called "database agnostic" only exists in theory. None of the major RDBMS completely implements the SQL standard and all of them have different weaknesses and strengths. You have to optimize for your RDBMS or get mediocre performance at best.
Indexing strategies are very important. 20 million rows doesn't matter much in a SELECT, as long as we can plug a hand full of row pointers from an index. Indexing strategies heavily depend on your brand of RDBMS. Columns:
You JOIN on,
Have WHERE conditions on,
Or are used in ORDER BY
May benefit from an index.
There are also various types of indexes, designed for various requirements. B-tree, GIN, GiST, . Partial, multicolumn, functional, covering. Various operator classes. To optimize performance you just need to know the basics and the capabilities of your RDBMS.
The excellent PostgreSQL manual on indexes to give you an overview.
I have seen three ways of handling this if indexing fails to give a big enough performance boost.
The first is to use temp tables. The more joins the database performed, the wores the estimated rows gets which can really slow down your query. If you run your joins and where clauses that will return the smallest number of rows, and store the intermediate results in a temp table to allow the cardinality estimator a more accurate count performance can improve significantly. This solution is the only once that doesn't create new database objects.
The second solution is a database warehouse, or at least an additional denormalized table(s). In this case you would create an additional table to hold the final results of the query, or several tables that perform the major joins and hold intermediate results. As an example, if you had a customers table, and three other tables that hold information about a customer, you could create a new table that holds the result of joining thsoe four tables. This solution generally works when you are using this query for reports and you can load the report table(s) each night with the new data generated during the day. This solution will be faster than the first, but is harder to implement and keep the results current.
The third solution is a materilized view/ indexed view. This solution depends heavily on the db platform you use. Oracle and Sql Server both have a way to create a view and then index it, giving you greater performance on the view. This can come at the cost of not having current records or greater data cost to store the view results but it can help.
Create materialized views and refresh them over the night. Or refresh them only when consider necessary. For example, you can have 2 views, one materialized with old data that is not going to be ever changed, and another normal view with actual data. And then a union between these. So you could have more views like these for any output you need.
If your database engine doesn't support materialized views, just denormalize the old data in another table over the night.
Check this also: Refresh a Complex Materialized View
I've been in same situation before, and my strategy was use WITH clause.
See more here.
WITH
-- group some tables into a "temporary" view called MY_TABLE_A
MY_TABLE_A AS
(
SELECT T1.FIELD1, T2.FIELD2, T3.FIELD3
FROM T1
JOIN T2 ON T2.PKEY = T1.FKEY
JOIN T3 ON T3.PKEY = T2.FKEY
),
-- group some tables into another "temporary" view called MY_TABLE_B
MY_TABLE_B AS
(
SELECT T4.FIELD1, T5.FIELD2, T6.FIELD3
FROM T4
JOIN T5 ON T5.PKEY = T4.FKEY
JOIN T6 ON T6.PKEY = T5.FKEY
)
-- use those views
SELECT A.FIELD2, B.FIELD3
FROM MY_TABLE_A A
JOIN MY_TABLE_B B ON B.FIELD1 = A.FIELD1
WHERE A.FIELD3 = "X"
AND B.FIELD2 = "Y"
;
If you want to know if there is another way to access the data. One approach would be to take an interest in the object concept. I any event on Oracle. it's works very well and simplify dev.
But it requires a business object approach.
From your example we can use two concept :
Reference
Inherence
Who can ease the readability of a query and sometimes speed.
1 : References
A reference is a pointer to an object. It allows the removal of joins between tables as they will be pointed.
Here is a simple Exemple :
CREATE TYPE S7 AS OBJECT (
id NUMBER(11)
, code NUMBER(11)
, label2 VARCHAR2(1024 CHAR)
);
CREATE TABLE S7_tbl OF S7 (
CONSTRAINT s7_k PRIMARY KEY(id)
);
CREATE TABLE S8 (
info VARCHAR2(500 CHAR)
, info2 NUMBER(5)
, ref_s7 REF S7 -- creation of the reference
);
We insert some datas in both table :
INSERT INTO S7_tbl VALUES ( S7 (1,1111, 'test'));
INSERT INTO S7_tbl VALUES ( S7 (2,2222, 'test2'));
INSERT INTO S7_tbl VALUES ( S7 (3,3333, 'test3'));
--
INSERT INTO S8 VALUES ('text', 22, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text2', 44, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text3', 11, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 2222));
And the SELECT :
SELECT s8.info, s8.info2 FROM S8 s8 WHERE s8.ref_s7.code = 1111;
RETURN :
text2 | 44
text | 22
Here is a type of implicit join
2 : inherence
CREATE TYPE S6 AS OBJECT (
name VARCHAR2(255 CHAR)
, date_start DATE
)
/
DROP TYPE S1;;
CREATE TYPE S1 AS OBJECT(
data1 NUMBER(11)
, data2 VARCHAR(255 CHAR)
, data3 VARCHAR(255 CHAR)
) INSTANTIABLE NOT FINAL
/
CREATE TYPE S2 UNDER S1 (
dummy1 VARCHAR2(1024 CHAR)
, dummy2 NUMBER(11)
, dummy3 NUMBER(11)
, info_s6 S6
) INSTANTIABLE FINAL
/
CREATE TABLE S5
(
info1 VARCHAR2(128 CHAR)
, info2 NUMBER(6)
, object_s2 S2
)
We just insert a row in the table
INSERT INTO S5
VALUES (
'info'
, 2
, S2(
1 -- fill data1
, 'xxx' -- fill data2
, 'yyy' -- fill data3
, 'zzz' -- fill dummy1
, 2 -- fill dummy2
, 4 -- fill dummy3
, S6(
'example1'
,SYSDATE
)
)
);
And the SELECT :
SELECT
s.info1
, s.objet_s2.data1
,s.objet_s2.dummy1
,s.objet_s2.info_s6.name
FROM S5 s;
We can see that by this method we can easily access related data without using.
hoping that it can serve you
if it's all subqueries you can do it in the sub queries for each and as all the matching data happens it should be as simple as below so long as all the tables c1,c2,c3
select * from
(select c1,c2,c3... from t1) as s1
inner join
(select c1,... from t2 where c1 = s1.c1) as s2
inner join
(unpivot table to get c1,c2... from t3 where c2 = s2.c2) as s3
inner join
(select c1,c2,c3... from t2 where c3 = s3.c3) as s4
You can make use of views and functions. Views make SQL code elegant and easy to read and compose. Functions can return single values or rowsets permitting fine-tuning the underlying code for efficiency. Finally, filtering at subquery level instead of joining and filtering at query level permits the engine produce smaller sets of data to join later, where indices are not that significant since the amount of data to join is small and can be efficiently computed on the fly. Something like the query below can be include highly complex queries involving dozens of tables and complex business logic hidden in views and functions, and still be very efficient.
SELECT a.*, b.*
FROM (SELECT * FROM ComplexView
WHERE <filter that limits output to a few rows>) a
JOIN (SELECT x, y, z FROM AlreadySignificantlyFilteredView
WHERE x IN (SELECT f_XValuesForDate(CURRENT_DATE))) b
ON (a.x = b.x AND a.y = b.y AND a.z <= b.z)
WHERE <condition for filtering even further>
I have a query which goes like this:
SELECT insanlyBigTable.description_short,
insanlyBigTable.id AS insanlyBigTable,
insanlyBigTable.type AS insanlyBigTableLol,
catalogpartner.id AS catalogpartner_id
FROM insanlyBigTable
INNER JOIN smallerTable ON smallerTable.id = insanlyBigTable.catalog_id
INNER JOIN smallerTable1 ON smallerTable1.catalog_id = smallerTable.id
AND smallerTable1.buyer_id = 'xxx'
WHERE smallerTable1.cont = 'Y' AND insanlyBigTable.type IN ('111','222','33')
GROUP BY smallerTable.id;
Now, when I run the query first time it copies the giant table into a temp table... I want to know how I can prevent that? I am considering a nested query, or even to reverse the join (not sure the effect would be to run faster), but that is well, not nice. Any other suggestions?
To figure out how to optimize your query, we first have to boil down exactly what it is selecting so that we can preserve that information while we change things around.
What your query does
So, it looks like we need the following
The GROUP BY clause limits the results to at most one row per catalog_id
smallerTable1.cont = 'Y', insanelyBigTable.type IN ('111','222','33'), and buyer_id = 'xxx' appear to be the filters on the query.
And we want data from insanlyBigTable and ... catalogpartner? I would guess that catalogpartner is smallerTable1, due to the id of smallerTable being linked to the catalog_id of the other tables.
I'm not sure on what the purpose of including the buyer_id filter on the ON clause was for, but unless you tell me differently, I'll assume the fact it is on the ON clause is unimportant.
The point of the query
I am unsure about the intent of the query, based on that GROUP BY statement. You will obtain just one row per catalog_id in the insanelyBigTable, but you don't appear to care which row it is. Indeed, the fact that you can run this query at all is due to a special non-standard feature in MySQL that lets you SELECT columns that do not appear in the GROUP BY statement... however, you don't get to select WHICH columns. This means you could have information from 4 different rows for each of your selected items.
My best guess, based on column names, is that you are trying to bring back a list of items that are in the same catalog as something that was purchased by a given buyer, but without any more than one item per catalog. In addition, you want something to connect back to the purchased item in that catalog, via the catalogpartner table's id.
So, something probably akin to amazon's "You may like these items because you purchased these other items" feature.
The new query
We want 1 row per insanlyBigTable.catalog_id, based on which catalog_id exists in smallerTable1, after filtering.
SELECT
ibt.description_short,
ibt.id AS insanlyBigTable,
ibt.type AS insanlyBigTableLol,
(
SELECT smallerTable1.id FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND st.catalog_id = ibt.catalog_id
LIMIT 1
) AS catalogpartner_id
FROM insanlyBigTable ibt
WHERE ibt.id IN (
SELECT (
SELECT ibt.id AS ibt_id
FROM insanlyBigTable ibt
WHERE ibt.catalog_id = sti.catalog_id
LIMIT 1
) AS ibt_id
FROM (
SELECT DISTINCT(catalog_id) FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND EXISTS (
SELECT * FROM insanlyBigTable ibt
WHERE ibt.type IN ('111','222','33')
AND ibt.catalog_id = st.catalog_id
)
) AS sti
)
This query should generate the same result as your original query, but it breaks things down into smaller queries to avoid the use (and abuse) of the GROUP BY clause on the insanlyBigTable.
Give it a try and let me know if you run into problems.
I am having a pretty complicated query. What I have is a table with car parts (parts_list, about 600 rows) which contains some information about the part like it's name and if it's motor dependent or not. For motor dependency I am having two different queries, so this is the one with no motor dependency (0, I save it as a boolean). Most of the parts can be disassembled and broken into more parts, that's why I am saving the parts as a tree and in that query I take only the parts that can't be disassembled (tree leaves). This table represents only the list of possible parts. Now for each car model I save a row in another table (parts) and I stick the parts_list_id and model_id together and then the price and quantity. Now if I run the query it will successfully generate about 500 rows(taking only the leaf parts) in the table "parts" and it will do what I need. About 500 (leaf) parts for the model id. But sometimes I generate a row for another model for a specific part. And then the query doesn't make the rest 499 rows. It only works if WHERE NOT EXISTS select query gives back 0. If even one row exists it doesn't insert the rest. But it doesn't make sense to me, because shouldn't it check with different values like a loop?
INSERT INTO parts (parts_list_id, model_id, motor_id)
SELECT orig1.id, '" . $this->model_id . "', '0'
FROM parts_list AS orig1
LEFT JOIN parts_list AS orig2 ON ( orig1.id = orig2.parent_id )
WHERE orig2.id IS NULL
AND orig1.motor_dependent = '0'
AND NOT EXISTS (
SELECT t1.id
FROM parts_list AS t1
LEFT JOIN parts_list AS t2 ON ( t1.id = t2.parent_id )
LEFT JOIN parts ON ( parts.parts_list_id = t1.id )
WHERE t2.id IS NULL
AND t1.motor_dependent = '0'
AND parts.parts_list_id = t1.id
AND parts.model_id = :model_id
)
Well, the sql statement seems fine. If there is one row, NOT EXISTS returns false. And that is correct. No one row is inserted. Perhaps you want to put some different checks using parts_list_id NOT IN (your subquery) instead of NOT EXISTS.
NOT EXISTS express a condition for the set as a whole, NOT IN is used to determine if the set has the right items.
I hope to get it right. It is a little bit difficult to understand your domain just from a single statement.
This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...
How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same time using email de-duplication (some subscribers may already exist in site2, but we don't want to insert those that already exist)?
I understand how to simply get the records from site 1 and add them into site 2, but how would we do this on such a large scale, and not causing data duplication? Any reading sources would be more than helpful for me, as ive found that a struggle.
i.e.:
Table 1: site1Subscribers
site1Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)
Table 2: site2Subscribers
site2Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)
I would try something like this:
insert into site2Subscribers
select * from site1Subscribers s1
left outer join site2Subscribers s2
on s1.subEmail = s2.subEmail
where s2.subEmail is null;
The left outer join along with the null check will return only those rows from site1Subscribers that have no matching entry in site2Subscribers.
I have a table in my database to store user data. I found a defect in the code that adds data to this table database where if a network timeout occurs, the code updated the next user's data with the previous user's data. I've addressed this defect but I need to clean the database. I've added a flag to indicate the rows that need to be ignored and my goal is to mark these flags accordingly for duplicates. In some cases, though, duplicate values may actually be legitimate so I am more interested in finding several user's with the same data (i.e, u> 2).
Here's an example (tablename = Data):
id---- user_id----data1----data2----data3----datetime-----------flag
1-----usr1--------3---------- 2---------2---------2012-02-16..-----0
2-----usr2--------3---------- 2---------2---------2012-02-16..-----0
3-----usr3--------3---------- 2---------2---------2012-02-16..-----0
In this case, I'd like to mark the 1 and 2 id flags as 1 (to indicate ignore). Since we know usr1 was the original datapoint (assuming the oldest dates are earlier in the list).
At this point there are so many entries in the table that I'm not sure the best way to identify the users that have duplicate entries.
I'm looking for a mysql command to identify the problem data first and then I'll be able to mark the entries. Could someone guide me in the right direction?
Well, first select duplicate data with their min user id:
CREATE TEMPORARY TABLE duplicates
SELECT MIN(user_id), data1,data2,data3
FROM data
GROUP BY data1,data2,data3
HAVING COUNT(*) > 1 -- at least two rows
AND COUNT(*) = COUNT(DISTINCT user_id) -- all user_ids must be different
AND TIMESTAMPDIFF( MINUTE, MIN(`datetime`), MAX(`datetime`)) <= 45;
(I'm not sure, if I used TIMESTAMPDIFF properly.)
Now we can update the flag in those rows where user_id is different:
UPDATE duplicate
INNER JOIN data ON data.data1 = duplicate.data1
AND data.data2 = duplicate.data2
AND data.data3 = duplicate.data3
AND data.user_id != duplicate.user_id
SET data.flag = 1;
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);
If there are duplicate times involved, maybe try this
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,,`datetime`,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3,`datetime`
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);