I work creating a database of merchants traders in the Atlantic at the end of the fifteen century. I wrote this query to find the relationships of individuals in the same contracts. It works fine, but It is creating duplicate results in the sense that will give for example; 'a,b,01-01-1500' and 'b,a,01-01-1500'. For me, it is a problem since I am not considering directionality at this point. I tried to simply filter the results but it created a more significant problem, as it just left one result regardless if the association had happened on a different date or a different contract (which granted was probably my fault, filtering is something I am still learning)
What would be a way to avoid this issue?
SELECT d1.gen_id as 'source',
d2.gen_id as 'target',
c.date as 'timestamp'
from (
Select * from deed_party1
) as d1
INNER JOIN
(
select * from deed_party1
) as d2
on d1.deed_id = d2.deed_id
inner join contracts as c
on d1.deed_id = c.id
where d1.gen_id != d2.gen_id AND c.date between '1500-01-01' AND '1509-12-31';
Thanks for reading and your help,
You simply want < rather than !=. But the subqueries are not necessary, so:
select d1.gen_id as source, d2.gen_id as target,
c.date as timestamp
from deed_party1 d1 join
deed_party1 d2
on d1.deed_id = d2.deed_id and
d1.gen_id < d2.gen_id join
contracts c
on d1.deed_id = c.id
where c.date between '1500-01-01' AND '1509-12-31';
Also:
Do not unnecessarily use subqueries. In most databases, this just clutters the query unnecessarily. MySQL has a tendency to materialize subqueries, so subqueries can have a (negative) performance impact as well.
Do not use single quotes to delimit column names. Single quotes should only be used for string and date constants.
If you must escape column names, use backticks.
Do not choose names that are already used by SQL, such as timestamp. This is safe in MySQL, but just a bad practice.
so 2 (more so 3) questions, is my query just badly coded or thought out ? (be kind, I only just discovered cross apply and relatively new) and is corss-apply even the best sort of join to be using or why is it slow?
So I have a database table (test_tble) of around 66 million records. I then have a ##Temp_tble created which has one column called Ordr_nbr (nchar (13)). This is basically ones I wish to find.
The test_tble has 4 columns (Ordr_nbr, destination, shelf_no, dte_bought).
This is my current query which works the exact way I want it to but it seems to be quite slow performance.
select ##Temp_tble.Ordr_nbr, test_table1.destination, test_table1.shelf_no,test_table1.dte_bought
from ##MyTempTable
cross apply(
select top 1 test_table.destination,Test_Table.shelf_no,Test_Table.dte_bought
from Test_Table
where ##MyTempTable.Order_nbr = Test_Table.order_nbr
order by dte_bought desc)test_table1
If the ##Temp_tble only has 17 orders to search for it take around 2 mins. As you can see I'm trying to get just the most recent dte_bought or to some max(dte_bought) of each order.
In term of index I ran database engine tuner and it says its optimized for the query and I have all relative indexes created such as clustered index on test_tble for dte_bought desc including order_nbr etc.
The execution plan is using a index scan(on non_clustered) and a key lookup(on clustered).
My end result is it to return all the order_nbrs in ##MyTempTble along with columns of destination, shelf_no, dte_bought in relation to that order_nbr, but only the most recent bought ones.
Sorry if I explained this awfully, any info needed that I can provide just ask. I'm not asking for just downright "give me code", more of guidance,advice and learning. Thank you in advance.
UPDATE
I have now tried a sort of left join, it works reasonably quicker but still not instant or very fast (about 30 seconds) and it also doesn't return just the most recent dte_bought, any ideas? see below for left join code.
select a.Order_Nbr,b.Destination,b.LnePos,b.Dte_bought
from ##MyTempTble a
left join Test_Table b
on a.Order_Nbr = b.Order_Nbr
where b.Destination is not null
UPDATE 2
Attempted another let join with a max dte_bought, works very but only returns the order_nbr, the other columns are NULL. Any suggestion?
select a.Order_nbr,b.Destination,b.Shelf_no,b.Dte_Bought
from ##MyTempTable a
left join
(select * from Test_Table where Dte_bought = (
select max(dte_bought) from Test_Table)
)b on b.Order_nbr = a.Order_nbr
order by Dte_bought asc
K.M
Instead of CROSS APPLY() you can use INNER JOIN with subquery. Check the following query :
SELECT
TempT.Ordr_nbr
,TestT.destination
,TestT.shelf_no
,TestT.dte_bought
FROM ##MyTempTable TempT
INNER JOIN (
SELECT T.destination
,T.shelf_no
,T.dte_bought
,ROW_NUMBER() OVER(PARTITION BY T.Order_nbr ORDER BY T.dte_bought DESC) ID
FROM Test_Table T
) TestT
ON TestT.Id=1 AND TempT.Order_nbr = TestT.order_nbr
There is a table called "basket_status" in the query below. For each record in basket_status, a count of yarn balls in the basket is being made from another table (yarn_ball_updates).
The basket_status table has 761 rows. The yarn_ball_updates table has 1,204,294 records. Running the query below takes about 30 seconds to 60 seconds (depending on how busy the server is) and returns 750 rows. Obviously my problem is doing a match against 1,204,294 records for all of the 761 basket_status records.
I tried making a view based on the query but offered no performance increase. I believe I read that for views you can't have sub queries and complex joins.
What direction should I take to speed up this query? I've never made a MySQL scheduled task or anything, but it seems like the "basket_status" table should have a "yarn_ball_count" count already in it, and an automated process should be updating that new extra count() column maybe?
Thanks for any help or direction.
SELECT p.id, p.basket_name, p.high_quality, p.yarn_ball_count
FROM (
SELECT q.id, q.basket_name, q.high_quality,
CAST(SUM(IF (q.report_date = mxd.mxdate,1,0)) AS CHAR) yarn_ball_count
FROM (
SELECT bs.id, bs.basket_name, bs.high_quality,ybu.report_date
FROM yb.basket_status bs
JOIN yb.yarn_ball_updates ybu ON bs.basket_name = ybu.alpha_pmn
) q,
(SELECT MAX(ybu.report_date) mxdate FROM yb.yarn_ball_updates ybu) mxd
GROUP BY q.basket_name, q.high_quality ) p
I don't think you need nested queries for this. I'm not a MySQL developer but won't this work?
SELECT bs.id, bs.basket_name, bs.high_quality, count(*) yarn_ball_count
FROM yb.basket_status bs
JOIN yb.yarn_ball_updates ybu ON bs.basket_name = ybu.alpha_pmn
JOIN (SELECT MAX(ybu.report_date) mxdate FROM yb.yarn_ball_updates) mxd ON ybu.report_date = mxd.mxdate
GROUP BY bs.basket_name, bs.high_quality
I have an issue with a particular left join slowing down an important query drastically. Using PHPMyAdmin to test the query, it states that the query took 3.9 seconds to return 546 rows, though I had to wait 67 seconds to see the results which struck me as odd.
There are two tables involved:
nsi (1,553 rows) and files (233,561 rows). All columns mentioned in this query are indexed individually as well as there being a compound index on filejobid, category and isactive in the files table. Everything being compared is an integer as well.
The goal of this watered down version of the query is to display the row from the nsi table once and be able to determine if someone has uploaded a file in category 20 or not. There can be multiple files but should only be one row, hence the grouping.
The query:
SELECT
nsi.id AS id,
f.id AS filein
FROM nsi nsi
LEFT JOIN files f
ON f.filejobid=nsi.leadid AND f.category=20 AND f.isactive=1
WHERE nsi.isactive=1
GROUP BY nsi.id
The 67 second load time for this data is simply unacceptable for my application and I'm at a loss as to how to optimize it any further. I've indexed and compound indexed. Should I just be looking into a new more roundabout solution instead?
Thank you for any help!
This is your query, which I find a bit suspicious, because you have an aggregation but not aggregation function on f.id. It will return an arbitrary matching id:
SELECT nsi.id AS id, f.id AS filein
FROM nsi LEFT JOIN
files f
ON f.filejobid = nsi.leadid AND f.category = 20 AND f.isactive = 1
WHERE nsi.isactive = 1
GROUP BY nsi.id;
For this query, I think the best indexes are files(filejobid), category, isactive) and nsi(isactive, filejobid, id).
However you can easily rewrite the query to be more efficient, because it doesn't need the group by (assuming nsi.id is unique):
SELECT nsi.id AS id,
(SELECT f.id
FROM files f
WHERE f.filejobid = nsi.leadid AND f.category = 20 AND f.isactive = 1
LIMIT 1
) AS filein
FROM nsi
WHERE nsi.isactive = 1;
The same indexes would work for this.
If you want a list of matching files, rather than just one, then use group_concat() in either query.
I know that I can join 2-3 small tables easily by writing simple joins. However, these joins can become very slow when you have 7-8 tables with 20 million+ rows, joining on 1-3 columns,
even when you have the right indices. Moreover, the query becomes long and ugly too.
Is there an alternative strategy for doing such big joins, preferably database agnostic?
EDIT
Here is pseudocode for the join. Note that some tables may have to be unpivoted before they are used in the join -
select * from
(select c1,c2,c3... From t1 where) as s1
inner join
(select c1,... From t2 where) as s2
inner join
(unpivot table to get c1,c2... From t3 where) as s3
inner join
(select c1,c2,c3... From t2 where) as s4
on
(s1.c1 = s2.c1)
and
(s1.c1 = s3.c1 and s1.c2 = s3.c2)
and
(s1.c1 = s4.c1 and s2.c2 = s4.c2 and s1.c3 = s4.c3)
Clearly, this is complicated and ugly. Is there a way to get the same result set in a much neater way without using such a complex join?
"7-8 tables" doesn't sound worrying at all. Modern RDBMS can handle a lot more.
Your pseudo-code query can be radically simplified to this form:
SELECT a.c1 AS a_c1, a.c2 AS a_c2, ... -- use column aliases ...
,b.c1, b.c2, ... -- .. If you really have same names more than once
,c.c1, c.c2, ...
,d.c1, d.c2, ...
FROM t1 a
JOIN t2 b USING (c1)
JOIN (unpivot table to get c1,c2... From t3 where) c USING (c1,c2)
JOIN t2 d ON d.c1 = a.c1 AND d.c2 = b.c2 AND d.c3 = d.c3
WHERE <some condition on a>
AND <more conditions> ..
As long as matching column names are unambiguous in the tables left of a JOIN, the USING syntax shortens the code. If anything can be ambiguous, use the explicit form demonstrated in my last join condition. That's all standard SQL, but according to this Wikipedia page:
The USING clause is not supported by MS SQL Server and Sybase.
It wouldn't make sense to use all those subqueries in your pseudo-code in most RDBMS. The query planner finds the best way to apply conditions and fetch columns itself. Smart query planners also rearrange tables in any order they see fit to arrive at a fast query plan.
Also, that thing called "database agnostic" only exists in theory. None of the major RDBMS completely implements the SQL standard and all of them have different weaknesses and strengths. You have to optimize for your RDBMS or get mediocre performance at best.
Indexing strategies are very important. 20 million rows doesn't matter much in a SELECT, as long as we can plug a hand full of row pointers from an index. Indexing strategies heavily depend on your brand of RDBMS. Columns:
You JOIN on,
Have WHERE conditions on,
Or are used in ORDER BY
May benefit from an index.
There are also various types of indexes, designed for various requirements. B-tree, GIN, GiST, . Partial, multicolumn, functional, covering. Various operator classes. To optimize performance you just need to know the basics and the capabilities of your RDBMS.
The excellent PostgreSQL manual on indexes to give you an overview.
I have seen three ways of handling this if indexing fails to give a big enough performance boost.
The first is to use temp tables. The more joins the database performed, the wores the estimated rows gets which can really slow down your query. If you run your joins and where clauses that will return the smallest number of rows, and store the intermediate results in a temp table to allow the cardinality estimator a more accurate count performance can improve significantly. This solution is the only once that doesn't create new database objects.
The second solution is a database warehouse, or at least an additional denormalized table(s). In this case you would create an additional table to hold the final results of the query, or several tables that perform the major joins and hold intermediate results. As an example, if you had a customers table, and three other tables that hold information about a customer, you could create a new table that holds the result of joining thsoe four tables. This solution generally works when you are using this query for reports and you can load the report table(s) each night with the new data generated during the day. This solution will be faster than the first, but is harder to implement and keep the results current.
The third solution is a materilized view/ indexed view. This solution depends heavily on the db platform you use. Oracle and Sql Server both have a way to create a view and then index it, giving you greater performance on the view. This can come at the cost of not having current records or greater data cost to store the view results but it can help.
Create materialized views and refresh them over the night. Or refresh them only when consider necessary. For example, you can have 2 views, one materialized with old data that is not going to be ever changed, and another normal view with actual data. And then a union between these. So you could have more views like these for any output you need.
If your database engine doesn't support materialized views, just denormalize the old data in another table over the night.
Check this also: Refresh a Complex Materialized View
I've been in same situation before, and my strategy was use WITH clause.
See more here.
WITH
-- group some tables into a "temporary" view called MY_TABLE_A
MY_TABLE_A AS
(
SELECT T1.FIELD1, T2.FIELD2, T3.FIELD3
FROM T1
JOIN T2 ON T2.PKEY = T1.FKEY
JOIN T3 ON T3.PKEY = T2.FKEY
),
-- group some tables into another "temporary" view called MY_TABLE_B
MY_TABLE_B AS
(
SELECT T4.FIELD1, T5.FIELD2, T6.FIELD3
FROM T4
JOIN T5 ON T5.PKEY = T4.FKEY
JOIN T6 ON T6.PKEY = T5.FKEY
)
-- use those views
SELECT A.FIELD2, B.FIELD3
FROM MY_TABLE_A A
JOIN MY_TABLE_B B ON B.FIELD1 = A.FIELD1
WHERE A.FIELD3 = "X"
AND B.FIELD2 = "Y"
;
If you want to know if there is another way to access the data. One approach would be to take an interest in the object concept. I any event on Oracle. it's works very well and simplify dev.
But it requires a business object approach.
From your example we can use two concept :
Reference
Inherence
Who can ease the readability of a query and sometimes speed.
1 : References
A reference is a pointer to an object. It allows the removal of joins between tables as they will be pointed.
Here is a simple Exemple :
CREATE TYPE S7 AS OBJECT (
id NUMBER(11)
, code NUMBER(11)
, label2 VARCHAR2(1024 CHAR)
);
CREATE TABLE S7_tbl OF S7 (
CONSTRAINT s7_k PRIMARY KEY(id)
);
CREATE TABLE S8 (
info VARCHAR2(500 CHAR)
, info2 NUMBER(5)
, ref_s7 REF S7 -- creation of the reference
);
We insert some datas in both table :
INSERT INTO S7_tbl VALUES ( S7 (1,1111, 'test'));
INSERT INTO S7_tbl VALUES ( S7 (2,2222, 'test2'));
INSERT INTO S7_tbl VALUES ( S7 (3,3333, 'test3'));
--
INSERT INTO S8 VALUES ('text', 22, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text2', 44, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 1111));
INSERT INTO S8 VALUES ('text3', 11, (SELECT REF(s) FROM S7_TBL s WHERE s.code = 2222));
And the SELECT :
SELECT s8.info, s8.info2 FROM S8 s8 WHERE s8.ref_s7.code = 1111;
RETURN :
text2 | 44
text | 22
Here is a type of implicit join
2 : inherence
CREATE TYPE S6 AS OBJECT (
name VARCHAR2(255 CHAR)
, date_start DATE
)
/
DROP TYPE S1;;
CREATE TYPE S1 AS OBJECT(
data1 NUMBER(11)
, data2 VARCHAR(255 CHAR)
, data3 VARCHAR(255 CHAR)
) INSTANTIABLE NOT FINAL
/
CREATE TYPE S2 UNDER S1 (
dummy1 VARCHAR2(1024 CHAR)
, dummy2 NUMBER(11)
, dummy3 NUMBER(11)
, info_s6 S6
) INSTANTIABLE FINAL
/
CREATE TABLE S5
(
info1 VARCHAR2(128 CHAR)
, info2 NUMBER(6)
, object_s2 S2
)
We just insert a row in the table
INSERT INTO S5
VALUES (
'info'
, 2
, S2(
1 -- fill data1
, 'xxx' -- fill data2
, 'yyy' -- fill data3
, 'zzz' -- fill dummy1
, 2 -- fill dummy2
, 4 -- fill dummy3
, S6(
'example1'
,SYSDATE
)
)
);
And the SELECT :
SELECT
s.info1
, s.objet_s2.data1
,s.objet_s2.dummy1
,s.objet_s2.info_s6.name
FROM S5 s;
We can see that by this method we can easily access related data without using.
hoping that it can serve you
if it's all subqueries you can do it in the sub queries for each and as all the matching data happens it should be as simple as below so long as all the tables c1,c2,c3
select * from
(select c1,c2,c3... from t1) as s1
inner join
(select c1,... from t2 where c1 = s1.c1) as s2
inner join
(unpivot table to get c1,c2... from t3 where c2 = s2.c2) as s3
inner join
(select c1,c2,c3... from t2 where c3 = s3.c3) as s4
You can make use of views and functions. Views make SQL code elegant and easy to read and compose. Functions can return single values or rowsets permitting fine-tuning the underlying code for efficiency. Finally, filtering at subquery level instead of joining and filtering at query level permits the engine produce smaller sets of data to join later, where indices are not that significant since the amount of data to join is small and can be efficiently computed on the fly. Something like the query below can be include highly complex queries involving dozens of tables and complex business logic hidden in views and functions, and still be very efficient.
SELECT a.*, b.*
FROM (SELECT * FROM ComplexView
WHERE <filter that limits output to a few rows>) a
JOIN (SELECT x, y, z FROM AlreadySignificantlyFilteredView
WHERE x IN (SELECT f_XValuesForDate(CURRENT_DATE))) b
ON (a.x = b.x AND a.y = b.y AND a.z <= b.z)
WHERE <condition for filtering even further>