Finding updated records in SSIS -- to hash or not to hash? - ssis

I'm working on migrating data from a table in a DB2 database to our SQL Server database using SSIS. The table that I am pulling data from contains a respectable amount of data--a little less than 100,000 records; but, it also has 46 columns.
I only want to update the rows that NEED to be updated, and so I came to conclusion that I could either use a Lookup Transformation and check all 46 columns and redirect the "no matches" to be updated on the SQL table. Or, I could hash each row in the datasets after I read the data in at the beginning of my data task flow, and then, subsequently, use the hash values as a comparison later on when determining if the rows are equal or not.
My question would be: Which is the better route to take? I like hashing them, but I'm not sure if that is the best route to take. Does anyone have any pearls of wisdom they'd like to share?

Why not both?
Generally speaking, there are two things we look for when doing an incremental load: Does this exist? If it exists, has it changed. If there's a single column, it's trivial. When there are many columns to check, that becomes quite the pain, especially if you're using SSIS to map all those columns and/or have to deal with worrying about NULLs.
I solve the multicolumn problem by cheating - I create two columns in all my tables: HistoricalHashKey and ChangeHashKey. Historical hash key will be all the business keys. Change hash key is all the rest of the material columns (I'd exclude things like audit columns). We are not storing the concatenated values directly in our hash columns. Instead, "we're going Math the stuff out of it" and apply a hashing algorithm called SHA-1. This algorithm will take all the input columns and return a 20 byte output.
There are three caveats to using this approach. You must concatenate the columns in the same order every time. These will be case sensitive. Trailing space is significant. That's it.
In your tables, you would add those the two columns as binary(20) NOT NULL.
Set up
Your control flow would look something like this
and your data flow something like this
OLESRC Incremental Data
(Assume I'm sourced from Adventureworks2014, Production.Product) I'm going to use the CONCAT function from SQL Server 2012+ as it promotes all data types to string and is NULL safe.
SELECT
P.ProductID
, P.Name
, P.ProductNumber
, P.MakeFlag
, P.FinishedGoodsFlag
, P.Color
, P.SafetyStockLevel
, P.ReorderPoint
, P.StandardCost
, P.ListPrice
, P.Size
, P.SizeUnitMeasureCode
, P.WeightUnitMeasureCode
, P.Weight
, P.DaysToManufacture
, P.ProductLine
, P.Class
, P.Style
, P.ProductSubcategoryID
, P.ProductModelID
, P.SellStartDate
, P.SellEndDate
, P.DiscontinuedDate
, P.rowguid
, P.ModifiedDate
-- Hash my business key(s)
, CONVERT(binary(20), HASHBYTES('MD5',
CONCAT
(
-- Having an empty string as the first argument
-- allows me to simplify building of column list
''
, P.ProductID
)
)
) AS HistoricalHashKey
-- Hash the remaining columns
, CONVERT(binary(20), HASHBYTES('MD5',
CONCAT
(
''
, P.Name
, P.ProductNumber
, P.MakeFlag
, P.FinishedGoodsFlag
, P.Color
, P.SafetyStockLevel
, P.ReorderPoint
, P.StandardCost
, P.ListPrice
, P.Size
, P.SizeUnitMeasureCode
, P.WeightUnitMeasureCode
, P.Weight
, P.DaysToManufacture
, P.ProductLine
, P.Class
, P.Style
, P.ProductSubcategoryID
, P.ProductModelID
, P.SellStartDate
, P.SellEndDate
, P.DiscontinuedDate
)
)
) AS ChangeHashKey
FROM
Production.Product AS P;
LKP Check Existence
This query will pull back the stored HistoricalHashKey and ChangeHashKey from our reference table.
SELECT
DP.HistoricalHashKey
, DP.ChangeHashKey
FROM
dbo.DimProduct AS DP;
At this point, it's a simple matter to compare the HistoricalHashKeys to determine whether the row exists. If we match, we want to pull back the ChangeHashKey into our Data Flow. By convention, I name this lkp_ChangeHashKey to differentiate from the source ChangeHashKey.
CSPL Change Detection
The conditional split is also simplified. Either the two Change Hash keys match (no change) or they don’t (changed). That expression would be
ChangeHashKey == lkp_ChangeHashKey
OLE_DST StagedUpdates
Rather than use the OLE DB Command, create a dedicated table for holding the rows that need to be updated. OLE DB Command does not scale well as behind the scenes it issues singleton update commands.
SQL Perform Set Based Updates
After the data flow is complete, all the data that needs updating will be in our staging table. This Execute SQL Task simply updates the existing data matching on our business keys.
UPDATE
TGT
SET
Name = SRC.name
, ProductNumber = SRC.
FROM
dbo.DimProduct AS TGT
INNER JOIN
Stage.DimProduct AS SRC
ON SRC.HistoricalHashKey = TGT.HistoricalHashKey;
-- If clustered on a single column and table is large, this will yield better performance
-- ON SRC.DimProductSK = TGT.DimProductSK;
From the comments
Why do I use dedicated INSERT and UPDATE statements since we have the shiny MERGE? Besides not remembering the syntax as easily, the SQL Server implementation can have some ... unintended consequences. They may be cornerish cases but I'd rather not run into them with the solutions I deliver. Explicit INSERT and UPDATE statements give me the fine grained control I want and need in my solutions. I love SQL Server, think it's a fantastic product but they weird syntax coupled with known bugs keeps me from using MERGE anywhere but a certification exam.

Related

M expression for SSIS equivalent

I've just started exploring DAX on PowerBI; so please excuse if this query sounds too novice to expert users. DAX functions are too 'different' if coming from a SQL background, hence the query.
In SSIS I'm using a function to replace values in a column based on a string (more so, an error in the value). I'm using the below to do the job easily:
Column2 = SUBSTRING([Column1],1,FINDSTRING([Column1],";#",1) - 1)
Even after looking at the Text functions on Microsoft help page.
for almost an hour trying to understand; I couldn't for some reason.
Any ideas?
An analogous expression in M would be
Text.Middle([Column1], 1, Text.PositionOf([Column1], ";#") - 1)
But you could also use Text.Start instead since you're starting at 1 or make it even simpler with Text.BeforeDelimiter:
Text.BeforeDelimiter([Column1], ";#")
In DAX, you'd use MID/LEFT instead of Text.Start/Text.Middle and FIND or SEARCH (depending on if you need case-sensitivity or not) instead of Text.PositionOf.
LEFT ( [Column1], SEARCH ( ";#", [Column1] ) - 1 )
Either way, the logic is nearly identical but you just have different function names.

Updating JSON in SQLite with JSON1

The SQLite JSON1 extension has some really neat capabilities. However, I have not been able to figure out how I can update or insert individual JSON attribute values.
Here is an example
CREATE TABLE keywords
(
id INTEGER PRIMARY KEY,
lang INTEGER NOT NULL,
kwd TEXT NOT NULL,
locs TEXT NOT NULL DEFAULT '{}'
);
CREATE INDEX kwd ON keywords(lang,kwd);
I am using this table to store keyword searches and recording the locations from which the search was ininitated in the object locs. A sample entry in this database table would be like the one shown below
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1}'
The location object attributes here are indices to the actual locations stored elsewhere.
Now imagine the following scenarios
A search for stackoverflow is initiated from location index "2". In this case I simply want to increment the value at that index so that after the operation the corresponding row reads
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":2,"5":1}'
A search for stackoverflow is initiated from a previously unknown location index "7" in which case the corresponding row after the update would have to read
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1,"7":1}'
It is not clear to me that this can in fact be done. I tried something along the lines of
UPDATE keywords json_set(locs,'$.2','2') WHERE kwd = 'stackoverflow';
which gave the error message error near json_set. I'd be most obliged to anyone who might be able to tell me how/whether this should/can be done.
It is not necessary to create such complicated SQL with subqueries to do this.
The SQL below would solve your needs.
UPDATE keywords
SET locs = json_set(locs,'$.7', IFNULL(json_extract(locs, '$.7'), 0) + 1)
WHERE kwd = 'stackoverflow';
I know this is old, but it's like the first link when searching, it deserves a better solution.
I could have just deleted this question but given that the SQLite JSON1 extension appears to be relatively poorly understood I felt it would be more useful to provide an answer here for the benefit of others. What I have set out to do here is possible but the SQL syntax is rather more convoluted.
UPDATE keywords set locs =
(select json_set(json(keywords.locs),'$.**N**',
ifnull(
(select json_extract(keywords.locs,'$.**N**') from keywords where id = '1'),
0)
+ 1)
from keywords where id = '1')
where id = '1';
will accomplish both of the updates I have described in my original question above. Given how complicated this looks a few explanations are in order
The UPDATE keywords part does the actual updating, but it needs to know what to updatte
The SELECT json_set part is where we establish the value to be updated
If the relevant value does not exsit in the first place we do not want to do a + 1 on a null value so we do an IFNULL TEST
The WHERE id = bits ensure that we target the right row
Having now worked with JSON1 in SQLite for a while I have a tip to share with others going down the same road. It is easy to waste your time writing extremely convoluted and hard to maintain SQL in an effort to perform in-place JSON manipulation. Consider using SQLite in memory tables - CREATE TEMP TABLE... to store intermediate results and write a sequence of SQL statements instead. This makes the code a whole lot eaiser to understand and to maintain.

Replacing existing View but MySQL says "Table doesn't exist"

I have a table in my MySQL database, compatibility_core_rules, which essentially stores pairs of ids which represent compatibility between parts which have fields with those corresponding ids. Now, my aim is to get all possible compatibility pairs by following the transitivity of the pairs (e.g. so if the table has (1,2) and (2,4), then add the pair (1,4)). So, mathematically speaking, I'm trying to find the transitive closure of the compatibility_core_rules table.
E.g. if compatibility_core_rules contains (1,2), (2,4) and (4,9), then initially we can see that (1,2) and (2,4) gives a new pair (1,4). I then iterate over the updated pairs and find that (4,9) with the newly added (1,4) gives me (1,9). At this point, iterating again would add no more pairs.
So my approach is to create a view with the initial pairs from compatibility_core_rules, like so:
CREATE VIEW compatibility_core_rules_closure
AS
SELECT part_type_field_values_id_a,
part_type_field_values_id_b,
custom_builder_id
FROM compatibility_core_rules;
Then, in order to iteratively discover all pairs, I need to keep replacing that view with an updated version of itself that has additional pairs each time. However, I found MySQL doesn't like me referencing the view in its own definition, so I make a temporary view (with or replace, since this will be inside a loop):
CREATE OR REPLACE VIEW compatibility_core_rules_closure_temp
AS
SELECT part_type_field_values_id_a,
part_type_field_values_id_b,
custom_builder_id
FROM compatibility_core_rules_closure;
No problems here. I then reference this temporary view in the following CREATE OR REPLACE VIEW statement to update the compatibility_core_rules_closure view with one iteration's worth of additional pairs:
CREATE OR REPLACE VIEW compatibility_core_rules_closure
AS
SELECT
CASE WHEN ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_a THEN ccr1.part_type_field_values_id_b
WHEN ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_b THEN ccr1.part_type_field_values_id_b
END ccrA,
CASE WHEN ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_a THEN ccr2.part_type_field_values_id_b
WHEN ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_b THEN ccr2.part_type_field_values_id_a
END ccrB,
ccr1.custom_builder_id custom_builder_id
FROM compatibility_core_rules_closure_temp ccr1
INNER JOIN compatibility_core_rules_closure_temp ccr2
ON (
ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_a OR
ccr1.part_type_field_values_id_a = ccr2.part_type_field_values_id_b
)
GROUP BY ccrA,
ccrB
HAVING -- ccrA and ccrB are in fact not the same
ccrA != ccrB
-- ccrA and ccrB do not belong to the same part type
AND (
SELECT ptf.part_type_id
FROM part_type_field_values ptfv
INNER JOIN part_type_fields ptf
ON ptfv.part_type_field_id = ptf.id
WHERE ptfv.id = ccrA
LIMIT 1
) !=
(
SELECT ptf.part_type_id
FROM part_type_field_values ptfv
INNER JOIN part_type_fields ptf
ON ptfv.part_type_field_id = ptf.id
WHERE ptfv.id = ccrB
LIMIT 1
)
Now this is where things go wrong. I get the following error:
#1146 - Table 'db509574872.compatibility_core_rules_closure' doesn't exist
I'm very confused by this error message. I literally just created the view/table only two statements ago. I'm sure the SELECT query itself is correct since if I try it by itself and it runs fine. If I change the first line to use compatibility_core_rules_closure2 instead of compatibility_core_rules_closure then it runs fine (however, that's not much use since I need to be re-updating the same view again and again). I've looked into the SQL SECURITY clauses but have not had any success. Also been researching online but not getting anywhere.
Does anyone have any ideas what is happening and how to solve it?
MySQL doesn't support sub-queries in views.
You'll have to separate them... ie. using another view containing the sub-query inside you main view.
Running the create statement for that view will render an error, not creating it, hence the doesn't exist error you're getting when querying it.

SQL server insert multiple rows and incrementing a int column

I have some rows in a table and need to transfer them to another table. In the destination table i need also to add a field with an incremental value.
I'm doing the following, but i know that something in the insert is wrong, because the incremented value (intCodInterno) is always the same:
INSERT INTO Emp_VISOT.dbo.TBL_GCE_ARTIGOS
( strCodigo ,
strDescricao ,
intCodInterno ,
intCodTaxaIvaCompra ,
intCodTaxaIvaVenda ,
strCodCategoria ,
strAbrevMedStk ,
strAbrevMedVnd ,
strAbrevMedCmp ,
bitAfectaIntrastat
)(
SELECT A.Artigo ,
a.Descricao ,
IDENT_CURRENT('Emp_VISOT.dbo.TBL_GCE_ARTIGOS')+1,
'3' ,
'3' ,
'1' ,
'Un' ,
'Un' ,
'Un' ,
'0'
FROM PRIVESAM.DBO.Artigo A)
What do i need to change so the value is incremented correcty?
Thank you.
EDIT:
I made a small change in the query, and now it works.
I just insert a SELECT in the IDENT_CURRENT inside brackets:
(SELECT IDENT_CURRENT('Emp_VISOT.dbo.TBL_GCE_ARTIGOS')+1)
I got all the rows that i need from the old table to the new with the incremented value.
the IDENT_CURRENT('Emp_VISOT.dbo.TBL_GCE_ARTIGOS')+1
evaluated once when you want to run the query and all the rows will get the same id.
first solution is to iterate over the select result by a loop construct like cursor or whatsoever and insert the incremented index(you do that)
second solution is to make that column in destination table identity
Remove the part with intCodInterno and in SQL Server use the Identity property to automatically increment it for you.
IDENT_CURRENT won't update until the transaction commits, therefore its value remains constant until you insert.
Here are three options for fixing this issue:
Use some kind of counter (#newRowNum) such that for each row in your SELECT query, #newRowNum = #newRowNum +1, and thus your intCodInterno number = IDENT_CURRENT() + #newRowNum. This would probably require a lot of hacking to work though. Don't recommend it.
Insert your rows sequentially using the same business logic you have now - it will be tremendously less performant, however. Don't recommend it.
Set that column in your destination table to be an identity column itself. This is by far the best way to do it.
If you need a custom identity function (I assume there's a reason you're not using an identity column now), you can create one using some of the steps outlined above: http://www.sqlteam.com/article/custom-auto-generated-sequences-with-sql-server
In my case , i Inserted rows sequentially using the same business logic. I cannot use auto increment as i have to import old data also into this column. Once you have imported the data then u may go for updating the column for auto increment .

Practical limit to length of SQL query (specifically MySQL)

Is it particularly bad to have a very, very large SQL query with lots of (potentially redundant) WHERE clauses?
For example, here's a query I've generated from my web application with everything turned off, which should be the largest possible query for this program to generate:
SELECT *
FROM 4e_magic_items
INNER JOIN 4e_magic_item_levels
ON 4e_magic_items.id = 4e_magic_item_levels.itemid
INNER JOIN 4e_monster_sources
ON 4e_magic_items.source = 4e_monster_sources.id
WHERE (itemlevel BETWEEN 1 AND 30)
AND source!=16 AND source!=2 AND source!=5
AND source!=13 AND source!=15 AND source!=3
AND source!=4 AND source!=12 AND source!=7
AND source!=14 AND source!=11 AND source!=10
AND source!=8 AND source!=1 AND source!=6
AND source!=9 AND type!='Arms' AND type!='Feet'
AND type!='Hands' AND type!='Head'
AND type!='Neck' AND type!='Orb'
AND type!='Potion' AND type!='Ring'
AND type!='Rod' AND type!='Staff'
AND type!='Symbol' AND type!='Waist'
AND type!='Wand' AND type!='Wondrous Item'
AND type!='Alchemical Item' AND type!='Elixir'
AND type!='Reagent' AND type!='Whetstone'
AND type!='Other Consumable' AND type!='Companion'
AND type!='Mount' AND (type!='Armor' OR (false ))
AND (type!='Weapon' OR (false ))
ORDER BY type ASC, itemlevel ASC, name ASC
It seems to work well enough, but it's also not particularly high traffic (a few hundred hits a day or so), and I wonder if it would be worth the effort to try and optimize the queries to remove redundancies and such.
Reading your query makes me want to play an RPG.
This is definitely not too long. As long as they are well formatted, I'd say a practical limit is about 100 lines. After that, you're better off breaking subqueries into views just to keep your eyes from crossing.
I've worked with some queries that are 1000+ lines, and that's hard to debug.
By the way, may I suggest a reformatted version? This is mostly to demonstrate the importance of formatting; I trust this will be easier to understand.
select *
from
4e_magic_items mi
,4e_magic_item_levels mil
,4e_monster_sources ms
where mi.id = mil.itemid
and mi.source = ms.id
and itemlevel between 1 and 30
and source not in(16,2,5,13,15,3,4,12,7,14,11,10,8,1,6,9)
and type not in(
'Arms' ,'Feet' ,'Hands' ,'Head' ,'Neck' ,'Orb' ,
'Potion' ,'Ring' ,'Rod' ,'Staff' ,'Symbol' ,'Waist' ,
'Wand' ,'Wondrous Item' ,'Alchemical Item' ,'Elixir' ,
'Reagent' ,'Whetstone' ,'Other Consumable' ,'Companion' ,
'Mount'
)
and ((type != 'Armor') or (false))
and ((type != 'Weapon') or (false))
order by
type asc
,itemlevel asc
,name asc
/*
Some thoughts:
==============
0 - Formatting really matters, in SQL even more than most languages.
1 - consider selecting only the columns you need, not "*"
2 - use of table aliases makes it short & clear ("MI", "MIL" in my example)
3 - joins in the WHERE clause will un-clutter your FROM clause
4 - use NOT IN for long lists
5 - logically, the last two lines can be added to the "type not in" section.
I'm not sure why you have the "or false", but I'll assume some good reason
and leave them here.
*/
Default MySQL 5.0 server limitation is "1MB", configurable up to 1GB.
This is configured via the max_allowed_packet setting on both client and server, and the effective limitation is the lessor of the two.
Caveats:
It's likely that this "packet" limitation does not map directly to characters in a SQL statement. Surely you want to take into account character encoding within the client, some packet metadata, etc.)
SELECT ##global.max_allowed_packet
this is the only real limit it's adjustable on a server so there is no real straight answer
From a practical perspective, I generally consider any SELECT that ends up taking more than 10 lines to write (putting each clause/condition on a separate line) to be too long to easily maintain. At this point, it should probably be done as a stored procedure of some sort, or I should try to find a better way to express the same concept--possibly by creating an intermediate table to capture some relationship I seem to be frequently querying.
Your mileage may vary, and there are some exceptionally long queries that have a good reason to be. But my rule of thumb is 10 lines.
Example (mildly improper SQL):
SELECT x, y, z
FROM a, b
WHERE fiz = 1
AND foo = 2
AND a.x = b.y
AND b.z IN (SELECT q, r, s, t
FROM c, d, e
WHERE c.q = d.r
AND d.s = e.t
AND c.gar IS NOT NULL)
ORDER BY b.gonk
This is probably too large; optimizing, however, would depend largely on context.
Just remember, the longer and more complex the query, the harder it's going to be to maintain.
Most databases support stored procedures to avoid this issue. If your code is fast enough to execute and easy to read, you don't want to have to change it in order to get the compile time down.
An alternative is to use prepared statements so you get the hit only once per client connection and then pass in only the parameters for each call
I'm assuming you mean by 'turned off' that a field doesn't have a value?
Instead of checking if something is not this, and it's also not that etc. can't you just check if the field is null? Or set the field to 'off', and check if type or whatever equals 'off'.