update target table given DateCreated and DateUpdated columns in source table - ssis

What is the most efficient way of updating a target table given the fact that the source table contains a DateTimeCreated and DateTimeUpdated column?

I would like to keep the source in target in synch avoiding a
truncate. I am looking for a bets practice pattern in this situation
I'll avoid a best practice answer but give enough detail to make an appropriate choice. There are two main methods with which you might update a table in SSIS, avoiding a TRUNCATE - LOAD:
1) Use an OLEBD COMMAND
This method is good if:
you have a reliable DateTimeUpdated column,
there are not many rows to update,
there are not a lot of columns to update
there are not many added columns in the dataflow (i.e. derived column transforms)
and the update statement is fairly straightforward.
This method performs poorly with many columns because it performs a row-by-row update. Relying on an audit date column can be a great method to reduce the number of rows to update, but it can also cause problems if rows are updated in the source system and the audit column is not changed. I recommend only trusted it if it has a trigger or you can be certain that no human can perform updates on the table.
Additionally, this component falls short when there is a lot of columns to map or a lot of transforms going on in the data flow. For example, if you are converting all string columns from unicode to non-unicode, you may have many additional columns in the mix that will make mapping and maintenance a pain. The mapping tool in this component is good for about 10 columns, it starts to get confusing very quickly after that. Especially because you are mapping to numbered parameters rather than column names.
Lastly, if you are doing anything complex in the update statement, it is better suited for SQL code rather than maintaining it in the components editor which has no intellisense and is generally painful to use.
2) Stage the data and perform the update in Execute SQL task after the data flow
This method is good for all the reasons that the OLEDB command is bad for, but has some disadvantages as well. There is more code to maintain:
a couple of t-sql tasks,
a proc
and a staging table
This means also that it takes more time to set up as well. However, it does perform very well and the code is far easier to read and understand. Ongoing maintenance is simpler as well.
Please see my notes from this other question that I happened to answer today on the same subject: SSIS Compare tables content and update another

Related

SSIS Data Flow: duplicated rule problem after lookup

I have a data flow that I need to get a column value from 'SQL tableA' and do a lookup task in 'SQL tableB' using this column value. If the lookup found a connection between the two tables, I need to get the value of another column from 'SQL tableA' and put this value in 'SQL tableC'( the table that will be persisted ). If lookup fail, this column value will be NULL.
My problem: After this behavior above, the rest of my flow is the same. So I have two duplicated equal flows below lookup. And this is terrible for readability and maintenance.
What do I can do to resolve this situation with little performance loss?
The data model is legacy, so change the data model is impossible.
Best Regards,
Luis
The way I see it, there are really three options:
Use UNION ALL and possibly sacrifice performance for modularity. There may in fact be no performance issue. You should test and see
If possible, implement all of this in a stored procedure. You can implement code reuse there and it will quite possibly run much faster
Build a custom transformation component that implements those last three steps.
This option appeals to all programmers but may have the worst performance and in my opinion will just cause issues down the track. If you're writing reams of C# code inside SSIS then you'll eventually reach a point where it's easier to just build a standalone app.
It would be much easier to answer if you explained
What you're really doing
slowly changing dimension?
data cleansing?
adding reference data?
spamming
What are those three activities?
sending an email?
calling a web service?
calling some other API?
What your constraints are
Is all of this data on one server and can you create stored procs and tables?

MYSQL DB Normalization & Query Indexes

We currently have a table that contains 90 columns and as the table is growing and the business needs change, we're having to alter the table alot (add/remove cols & indexes).
|------ (Table name: quotes)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
....
|completed_at|datetime|Yes|NULL
|reviewed_at|datetime|Yes|NULL
|marked_dud_at|datetime|Yes|NULL
|closed_at|datetime|Yes|NULL
|subscribed_at|datetime|Yes|NULL
|admin_checked_at|datetime|Yes|NULL
|priced_at|datetime|Yes|NULL
|number_verified_at|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
|deleted_at|datetime|Yes|NULL
For the application, our staff are constantly querying all sorts of variations on the above data, example being where it has been completed (completed_at), checked (admin_checked_at) and not deleted, reviewed (deleted_at, reviewed_at)
We're thinking it may be easier to offload some of these columns into their own row, we'll call it quotes_actions, then when querying do some joining.
|------ (Table name: quotes_actions)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
|quote_id|int(11)|No|
|action|varchar(100)|No|
|user_id|int(11)|No|
|time|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
An example would be action = 'completed' using the field, with an index covering quote_id and action.
We've split the data into this format on 150,000 rows and it's not any faster nor slower than querying the original database with correct indexes.
Has anyone got any experience with this and has any recommendations or pitfalls for each approach? It's taking a lot of time to add covering indexes and add columns to the original table as we needed them, whereas the second approach has the indexes set up ready to go but is introducing a lot more joins and more complicated queries.
0.09s
select * from `quotes`
where `completed_at` is not null
and `approved_at` is not null
and deleted_at is null
=>
0.0005s
select * from `quotes_new`
inner join quotes_actions as q1 on q1.action = 'completed' and q1.quote_id = quotes_new.id
inner join quotes_actions as q2 on q2.action = 'approved' and q2.quote_id = quotes_new.id
where quotes_new.deleted_at is null
In addition, if the 2nd approach is better, how do you query for negative results, where a quote hasn't been approved?
Database design will vary from application to application, and things that are great for one implementation will be terrible for another. You've identified a few things that are important to you:
speed of data access (at least no reduction in current performance)
ability to respond to application needs/changes
limiting complexity of queries
Without being able to see the entirity of your database and how you are using it, these are the principles I would follow:
Use Stored Procedures and Views for as much as possible
This is just good design. You create an adapter layer between your application and the data tables, which allows you to make whatever changes you need to in the database (and the views/stored procs) without having to change the application itself. Decoupling your systems makes maintenance significantly easier. Also this is good for security, as if the only way outsiders can access the data is through your stored procs, you've eliminated a few avenues of attack. (There's also debate about whether or not the DBMS will cache execution plans for stored procedures, making them execute faster than similar queries, but I'm not a DBA or DBDev, so I'm not touching that).
Attempt to limit width of tables
One thing I've seen time and time again is every time a need arises in a production systems, a column gets added to a table and they call it a day. Far easier than rewriting a bunch of queries or reviewing table structures. This is terrible design. If you've already limited the changes needed to the application layer by following my first piece of advice, you've limited the work needed to actually resolve table changes in the right way. You should always evaluate whether data belongs to the row in question, or if it should be offloaded into its own table. You shouldn't be afraid to radically alter your database, as sometimes it is necessary.
Looking at the data you've provided, I think your second option is okay. You've identified many columns that actually represent the same thing (the "status changes" or as you put it "quote actions" that occur) and offloaded that from the main table to a secondary table. This is perfectly fine, and likely will be effective. You can further "cheat" to make this table faster by offloading status onto its own table, and using an integer to represent it instead of a string (since the string doesn't matter to the database, and integers are far faster to index and search).
This is not to say a wide table is a bad thing, sometimes tables just need to be wide. You just need to evaluate whether the data really belongs to the entity the data row represents.
Approach queries in new ways
You will want to play with the execution plan tools of your DBMS and understand how each query really works. Changing the order of joins can drastically alter the query return speed, and you shouldn't be afraid to use table variables and temp tables in your queries. They are all tools at your disposal.
Querying for Negative Results
Since you asked this question specifically, I'll address it. This requires thinking about your query in a little different way (consequently, if you haven't, you should look into taking a course or working through a textbook of Relational Algebra, it makes understanding databases so much easier).
Your original query made finding something where the quote was not approved easy. It was all in the table: approved_at is null. Simple, easy peasy, no problems. Now, however, instead of being in a column on the main table, it is in its own table, that also represents all the other actions that could be taken. You need to break the problem down a little.
You want to find the set wherein of all orders, there is no action to signify it is approved. In SQL that looks like:
select quote_id from quotes_action where quote_id not in
(select quote_id from quotes_action where action = 'approved');
Final Thoughts
You need to sit down with your team and talk about how you want to move forward with this product. Spend a few days or a couple weeks really thinking deeply about it. Brainstorm....hackathon....do something to find a solution you like and makes your product better and more maintainable. We've all been in the situation where we have an unmaintainable product that could have been fixed at some point, but is beyond that point. Try not to get to that point, and fix it while you have the opportunity.

MySQL database activity log: fields vs table

So basically I am in the process of creating a personal finance tracking system. It occurred to be that keeping tabs on when each instance and transaction was last edited or updated might be of relevant information some day.
Now as far as I can see there are two approaches to implement something like this:
Create "updated" fields to all the tables I want to keep track of and then let mysql update those fields for me (ON UPDATE clause)
Create a completely seperate table for holding the log data and then update that with a triggers and transactions
Now it seems that 1st approach would have the benefit of keeping things simple and easy to maintain. However how this will impact the performance if I suddenly decide to get every log in the database for review. Also this would kind of goes against normalization (not by much though) with same data stored in multiple tables.
The second approach would allow more flexibility to the logging system and might actually shorten the sql query necessary to retrieve certain data. However it would make the schema more complex as two additional tables would have to be created (the actual log table and many-to-many relation table for holding the keys) and maintained. On the other hand if I ever want to implement an activity history this approach would propably be the only one capable of doing it.
As such I would like to know some more pros and cons to each method. Since 2nd option allows more flexibility I am considering implementing it but I am not sure about performance issues. In the end it comes down to two guestions:
Are there any real life examples where both approaches are
implemented?
And:
Are there any studies, comparisons or other resource that might shed
some light on which is considered more performance friendly and "best
practices" approach?
It depends on what kind of reporting you need and your current architecture.
If you just want to know last update date, then having 2 fields (creation date and last update) should be enough. That's because having separate table won't give any perfomance boost, but will make your code harder to maintain.
It's another story if you want to have something more elaborate, like reporting differences (what was changed) and/or have full change log on each transaction (there might be few updates to one transaction, right?). In this case you actually must have separate table, because otherwise it will bloat your table and reduce perfomance.
Based on my experience, I'd go with separate table. That's because it will be easier to maintain - your logging logic will be practically separated from everything else and I think one day you'll need that additional info on your transactions and full transaction history.
As far as perfomance goes, you won't notice any formidable difference unless your system will be under serious load. But as your system is personal, any choice would suffice, just don't forget about proper indexing.
Note that I'm making alot of assumptions here, so if you want something more specific, please provide your actual architecture and reporting needs. I'd suggest some books on high availability/perfomance, but they are not on your specific needs, but on general availability/perfomance.

CREATE TABLE auto append default columns?

In MySQL, is it possible to append default columns after creation or create them automatically? A brief overview is this:
All tables must have 5 fields that are standardized across our databases (created_on, created_by, row_status etc). Its sometimes hard for developers to remember to do this and/or not done uniformly. Going forward we'd like to automate the task some how. Does anyone know if its possible to create some sort of internal mysql script that will automatically append a set of columns to a table?
After reading through some responses, I think i'd rephrase the question bit, rather than making it an autoamtic task (i.e EVERY Table), make it function that can be user-triggered to go through and check for said columns and if not add them. I'm pretty confident this is out of SQL's scope and would require a scripting language to do, not a huge issue but it had been preferable to keep things encapsulated within SQL.
I'm not very aware of MySQL specific data modeling tools, but there's no infrastructure to add columns to every table ever created in a database. Making this an automatic behavior would get messy too, when you think about situations where someone added the columns but there were typos. Or what if you have tables that are allowed to go against business practice (the columns you listed would typically be worthless on code tables)...
Development environments are difficult to control, but the best means of controlling this is by delegating the responsibility & permissions to as few people as possible. IE: There may be 5 developers, but only one of them can apply scripts to TEST/PROD/etc so it's their responsibility to review the table scripts for correctness.
i would say first - don't do that.
make an audit table seperately - and link with triggers.
otherwise, you will need to feed your table construction through a procedure or other application that will create what you want.
I'd first defer to Randy's answer - this info is probably better extracted elsewhere.
That said, if you're set on adding the columns, ALTER TABLE is probably what you're looking for. You might consider also including some extra logic to determine which columns are missing for each table.

Pros and Cons for CreatedDate and ModifiedDate columns in all database tables

What are the pros and cons? When should we have them and when we shouldn't?
UPDATE
What is this comment in an update SP auto generated with RepositoryFactory? Does it have to do anything with above columns not present?
--The [dbo].[TableName] table doesn't have a timestamp column. Optimistic concurrency logic cannot be generated
If you don't need historical information about your data adding these columns will fill space unnecessarily and cause fewer records to fit on a page.
If you do or might need historical information then this might not be enough for your needs anyway. You might want to consider using a different system such as ValidFrom and ValidTo, and never modify or delete the data in any row, just mark it as no longer valid and create a new row.
See Wikipedia for more information on different schemes for keeping historic information about your data. The method you proposed is similar to Type 3 on that page and suffers from the same drawback that only information about the last change is recorded. I suggest you read some of the other methods too.
All I can say is that data (or full blown audit tables) has helped me find what or who caused a major data problem. All it takes is one use to convince you that it is good to spend the extra time to keep these fields up-to-date.
I don't usually do it for tables that are only populated through a single automated process and no one else has write permissions to the table. And usually it isn't needed for lookup tables which users generally can't update either.
There are pretty much no cons to having them, so if there are any chance you will need them, then add them.
People may mention performance or storage concerns but,
in reality they will have little to no effect on SELECT performance with modern hardware, and properly specified SELECT clauses
there can be a minor impact to write performance, but this will likley only be a concern in OLTP-type systems, and this is exactly the case where you suually want these kinds of columns
if you are at the point where adding columns like this are a dealbreaker in terms of performance, then you are likely looking at moving away from SQL databases as a storage platform
With CreatedDate, I almost always set it up with a default value of GetDate(), so I never have to think about it. When building out my schema, I will add both of these columns unless it is a lookup table with no GUI for administering it, because I know it is unlikely the data will be kept up to date if modified manually.
Some DBMSs provide other means to capture this information autmatically. For example Oracle Flashback or Microsoft Change Tracking / Change Data Capture. Those methods also capture more detail than just the latest modification date.
That column type timestamp is misleading. It has nothing to do with time, it is rowversion. It is widely used for optimistic concurrency, example here