In AbInitio, will join take the same time to process matched records as that of unmatched one? - ab-initio

In AbInitio ETL tool, If there is a inner join component and million records are matched on some keys, there is some simple transformation logic in the join itself and rejected records are collected to process further.
Will join take same time to process the records if they are matched compared to unmatched?

In general, as with everything, run a test, and see.
If you think about it, having records joined should run faster as compared to no joins. Reason being is that once you join, you can discard the (inner) records, and it will not be prescient in the next iteration of trying to find a match. Resulting in less computational workload.

Related

Why does selecting/updating on inner-join take so much longer on higher IDs and is threr a way around it?

I have two tables, one with ~250k records one with ~1.5M records.
I am doing a Select and/or Update using Inner Join.
It hung so I decided to to in batches e.g.
Where ID<100000
then
Where ID>=100000 and ID<200000
with each successive batch it takes longer and longer. Adding to the issue of course is there might be records that would have joined w/o being separated by batches.
I'm curious why that is (why do the first 100k IDs join fast and the last 100k take so much longer) and if there is a way to do as a single query w/o the IDs becoming a/the issue (if they in fact are).

Continuation, I am wondering why it produces tremendous file while should not have produced?

Continuation
select a.artno, a.name
from Art a
inner join store s on a.Artno <> s.Artno`
Running this query took me more than 1 min producing more 899K rows while was supposed to bring out 7.9K results.
select artno
from art
except
(select artno from store)
This line of code provides me with 7.9K rows which is true for me.
The first codes seems to be working code but takes a hack of a time and produces a large result set. Wondering why?
It's generally NOT a good idea to use a <> operator with an INNER JOIN unless you really know that you want a lot of records. In other words, the JOIN is a great tool for inclusion, not exclusion.
When you do an INNER JOIN using a <> operator (especially on the keys), the query brings back every combination of art and store records except where the Artno keys match.
So, if your have 4 art records and 5 store records, only one of which had a matching ArtNo value, you would end up with 4 x 5 - 1 = 19 records.
The second query simply displays all art records that aren't in any store.
Well the two queries are different. The first query is joining the results from both tables, and while the condition may be the same, based on your results it's a one-to-many relationship between the two tables.
In contrast the second query is not joining the two results but rather selecting from the ART table and excluding the art numbers you supplied from another table.
Finally, the reason the second query is taking a lot longer without knowing a lot about your database is a guess but I'm going to give it a shot.
The first bottleneck is that it's joining two tables that are clearly not one-to-one, but the second bottleneck is probably indexing OR the size of the left hand table. Keep in mind that in a JOIN like the one you have the left hand table is scanned and ideally the right hand is a seek.
Does that make sense?

SQL Server OUTER JOIN multiple linked fields

I am trying to query data from two tables into one tables using OUTER JOIN. The thing is that to uniquely identify the rows, three fields are needed. This brings me to query containing this expression:
FROM Data1 DB
RIGHT OUTER JOIN Data2 FT on (DB.field1 = FT.Value1
and DB.field2 = FT.field2
and DB.field3 = FT.field3)
However, the query runs for pretty much forever. To test the whole thing I used WHERE conditions and FULL OUTER JOIN and in the case of WHERE conditions it is done almost instantly whereas using the FULL OUTER JOIN I had the same trouble and usually ended up cancelling the whole thing after 5 minutes or so.
Can anyone see what I am doing wrong with my query? Thanks for any help!
Do you really need all the records back from the query? Some WHERE criteria could cut execution time down considerably.
Yes, and indexes. Check the plan and create recomended indexes.
Your best bet is to view the execution plan (and if you are comfortable with it, post a screenshot of it in your question). That'll tell you where the most expensive portion of the query is happening.

What's better: joins or multiple sub-select statements as part of one query

Performance wise, what is better?
If I have 3 or 4 join statements in my query or use embedded select statements to pull the same information from my database as part of one query?
I would say joins are better because:
They are easier to read.
You have more control over whether you want to do an inner, left/right outer join or full outer join
join statements cannot be so easily abused to create query abominations
with joins it is easier for the query optimizer to create a fast query (if the inner select is simple, it might work out the same, but with more complicated stuff joins will work better).
embedded select's can only simulate left/right outer join.
Sometimes you cannot do stuff using joins, in that case (and only then) you'll have to fall back on an inner select.
It rather depends on your database: sizes of tables particularly, but also the memory parameters and sometimes even how the tables are indexed.
On less than current versions of MySQL, there was a real possibility of a query with a sub-select being considerably slower than a query that would return the same results structured with a join. (In the MySQL 4.1 days, I have seen the difference to be greater than an order of magnitude.) As a result, I prefer to build queries with joins.
That said, there are some types of queries that are extremely difficult to build with a join and a sub-select is the only way to really do it.
Assuming the database engine does absolutely no optimization, I would say it depends on how consistent you need your data to be. If you're doing multiple SELECT statements on a busy database, where the data you are looking at may change rapidly, you may run into issues where your data does not match up, between queries.
Assuming your data contains no inter-dependencies, then multiple queries will work fine. However, if your data requires consistency, use a single query.
This viewpoint boils down to keeping your data transactionally safe. Consider the situation where you have to pull a total of all accounts receivable, which is kept in a separate table from the monetary transaction amounts. If someone were to add another transaction in between your two queries, the accounts receivable total would not match the sum of the transaction amounts.
Most databases will optimize both queries below into the same plan, so whether you do:
select A.a1, B.b1 from A left outer join B on A.id = B.a_id
or
select A.a1, (select B.b1 from B where B.a_id = A.id) as b1 from A
It ends up being the same. However, in most cases for non-trivial queries you'd better stick with joins whenever you can, especially since some types of joins (such as an inner join) are not possible to achieve using sub-selects.

SQL SERVER 2008 JOIN hints

Recently, I was trying to optimise this query
UPDATE Analytics
SET UserID = x.UserID
FROM Analytics z
INNER JOIN UserDetail x ON x.UserGUID = z.UserGUID
Estimated execution plan show 57% on the Table Update and 40% on a Hash Match (Aggregate). I did some snooping around and came across the topic of JOIN hints. So I added a LOOP hint to my inner join and WA-ZHAM! The new execution plan shows 38% on the Table Update and 58% on an Index Seek.
So I was about to start applying LOOP hints to all my queries until prudence got the better of me. After some googling, I realised that JOIN hints are not very well covered in BOL. Therefore...
Can someone please tell me why applying LOOP hints to all my queries is a bad idea. I read somewhere that a LOOP JOIN is default JOIN method for query optimiser but couldn't verify the validity of the statement?
When are JOIN hints used? When the sh*t hits the fan and ghost busters ain't in town?
What's the difference between LOOP, HASH and MERGE hints? BOL states that MERGE seems to be the slowest but what is the application of each hint?
Thanks for your time and help people!
I'm running SQL Server 2008 BTW. The statistics mentioned above are ESTIMATED execution plans.
Can someone please tell me why applying LOOP hints to all my queries is a bad idea. I read somewhere that a LOOP JOIN is default JOIN method for query optimiser but couldn't verify the validity of the statement?
Because this robs the optimizer of the opportunity to consider other methods which can be more efficient.
When are JOIN hints used? When the sh*t hits the fan and ghost busters ain't in town?
When the data distribution (on which the optimizer makes its decisions) is severely skewed and the statistics are no able to represent it correctly.
What's the difference between LOOP, HASH and MERGE hints? BOL states that MERGE seems to be the slowest but what is the application of each hint?
These are different algorithms.
LOOP is nested loops: for each record from the outer table, the inner table is searched for matches (using the index of available). Fastest when only a tiny portion of records from both tables satisfy the JOIN and the WHERE conditions.
MERGE sorts both tables are traverses them in the sort order, skipping the unmatched records. Fastest for the FULL JOINs and when both recordsets are already sorted (from previous sort operations or when the index access path is used)
HASH build a hash table in the temporary storage (memory or tempdb) from one of the tables and searches it for each record from the other one. Fastest if the large portion of records from either table matches the WHERE and JOIN condition.
The Estimated execution plan show 57%
on the Table Update and 40% on a Hash
Match (Aggregate). I did some snooping
around and came across the topic of
JOIN hints. So I added a LOOP hint to
my inner join and WA-ZHAM! The new
execution plan shows 38% on the Table
Update and 58% on an Index Seek.
Surely that means that your proposed plan is worse? Assuming the table update takes a constant time it is now being out costed by the index activity.