SSIS Data Migration: Split Flat table into Parent + Child/Grandchild tables - ssis

I need to migrate data in a large flat table located in SQL Server 2005 into a new SQL Server 2005 schema that consists of a parent table and multiple child tables. This seems like the opposite of a merge or merge join in SSIS but I don't understand how I would go about accomplishing this. Any recommendations are greatly appreciated. Ever seen any examples of how other accomplish this sort of thing?
The flat Source table [FlatSource] has < 280K records and some garbage data so I will need to handle these things at some point. But for now, here is gist of what I need to accomplish...
The flat source table will mostly map to the new parent table [Parent]. That is to say: For each record in the [FlatSource], I need to move this record into [Parent].
Once this is done, I need to record the PK of this new parent record and add numerous child records. This PK will be used when adding 0-4 records into a child table [Child1]. Basically there may be 0-4 columns that if populated will require a new record in [Child1] that will use the PK from the [Parent].
Once this is done, I will need to populate 0-4 new records into [Grandchild] that will use the PK from [Child].
Thanks for any insight you can offer. I have started a project in C# but the more I dig into it, the more it seems like a task for SSIS.
Sincerely,
Josh Blair
Golden, CO

It looks like this would have been the task for a 'conditional splt' data flow task. This would have sat after your data source, and you would have added different splt conditions within the component itself.
When connecting destinations the the conditional split, you can specify which 'condition' is being recieved by the destination. As you can have many conditions, you can have many destinations.

Related

best approach to exchanging data dumps between organizations

I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.

Neo4J custom load CSV

I asked a question a few days ago to know how to import an existing database into Neo4J. Thanks to the person who explained me how to do that. I decided to create a CSV file from my database (around 1 million entries) and to load it from the Neo4j webadmin to test it. The problem is that each row of this database contains redundant data, for example my database contains actions from different users but each user can do mutliple actions. The structure of my graph would be to create a node for each user that is linked to each action he does. That's why I have to create only one node for each user even if his name appears in several rows of my CSV file (because he made several actions). What is the method to do that ? I guess it's possible to do that in Cypher right ?
Thanks a lot
Regards
Sam
In case you have references that might or might not exist, you should use the MERGE statement. MERGE either finds something or creates something in your database.
Please refer to the respective section in the reference manual: http://docs.neo4j.org/chunked/stable/cypherdoc-importing-csv-files-with-cypher.html. Here the country is shared my multiple users there the country is merged wheres the users and their relationships to countries are unconditionally created.

MS SQL Server: using CDC to populate single destination table from several source tables

Can I use Change Data Capture in MS SQL Server (2008 or 2012) with the SSIS Package which joins several source tables into one destination table?
Technet articles describe CDC + SSIS usage cases when source table and destination table have the same structure. The only hint to the possibility of change tracking for custom data transformations is that it is possible to specify the columns for which CDC will track changes.
The problem is, I need to combine data from a number of source tables to get the destination table and then keep it in sync with those source tables.
This is because the data in the destination datawarehouse is normalized to lesser extent than in the source database. For example, I have Events table (containting Computer ID, Date/Time, and Event Description) and Computers table (containting Computer ID and Computer Name). I don't need those normalized tables and computer ids in the destination table, so the select to fill the destination tables should be:
INSERT INTO DestDB..ComputerEvents (ComputerName, DateTime, Event)
SELECT s.ComputerName, e.DateTime, e.Event
FROM SourceDB..EventLog e
JOIN SourceDB..ComputerNames s
ON e.CompID = s.CompID
I just cannot figure out how to make CDC work with SSIS Package containing such transformation? Is it even possible?
To answer the question: No, you can't.
As one other responder has pointed out, CDC can only tell you what changed in EACH source table since the last time you extracted changes.
Using CDC to extract changes from multiple source tables to load a single destination table is anything but simple.
Let's show why by means of an example. For this example I assume that a staging table is a table that is truncated routinely before being populated.
Suppose we have two source tables: Order, OrderDetail. We have one destination fact table FactOrder. FactOrder contains the OrderKey (from Order) and the sum of order amount from OrderDetail. A customer orders 3 products. One Order and 3 OrderDetail records are inserted into the source database tables. Our DW ETL extracts the 1 order record (insert) and 3 OrderDetail records (insert). If we chose to load changed records into staging tables as a previous responder said we could simply join our staging tables to create our FactOrder record. But, what happens if the we no longer carry one of the products and someone deletes a record from the OrderDetail record. The next DW ETL extracts 1 OrderDetail record (delete). How do we use this information to update the target table? Clearly we can't join from Order to OrderDetail because Order has no record for this particular OrderKey since it is a staging table that we just truncated. I chose a delete example but consider the same problem if dependent tables are updated.
What I propose instead is to extract the distinct set of primary key (OrderKey in our example) values for which there are changes in the any of the source tables required to build the FactOrder record and then extract the full FactOrder record in a subsequent request. For example, if 5 Order records are changed we know the 5 OrderKey values. If 30 OrderDetail records are changed we need to determine the distinct set of OrderKey values. Let's say that is 10 OrderKey. We then union the two sets. Let's say that there is overlap so that yields 12 OrderKey values. Now we seed our FactOrder extract query with the 12 OrderKey values. We get back 12 complete FactOrder records. We then use comparison of new to stored binary checksum to determine how to action the 12 records (insert or update). The above approach does not cover deletes from the Order table. Those would result in trivial deletes from FactOrder.
The many examples out there as you noted show how to use CDC to replicate/synchronize data from 1 source to 1 destination which isn't a typical data warehouse load use case since the tables in the data warehouse are typically denormalized (thus requiring joins among multiple source tables to build the destination row).
OK first thing CDC captures changes in a table, so if there was some insert or delete or update in a table then a CDC record gets created with an indicator to say insert or update or delete and all the CDC task does is output records to one of the three outputs based on that indicator column so coming back to your question you might have to have multiple OLDEDB Sources and CDC Task for each Source and UNION ALL similar operations (insert,update, delete) together and then the Destination component or OLEDB Command component hope this helps :)
Consider CDC as if it were your automated mechanism for filling staging tables (instead of a sql query, or replication), using one CDC source table pointed at one regular staging table. From there simply build your joined queries against the multiple staging tables as needed.
My assumption is, that you are pulling data from non-identical tables, like
an Order table,
an OrderDetail table, etc.
If you are pulling from several identical tables in the same or different dbs, then you can push the output of the CDC directly into the staging table and you're done.

Update Data in table. Lookup ? Merge?

I am in need of a solution.
I am supposed to load the data of a table from PROD server to UAT. If records are missing in UAT, load the missing rows. How Should i go about it ?
Second Problem.
I am fetching some data (EmpId,NAME,CreditCardNumebr) from some text files. They are collaborated based on EmpId from a table in SQL Server (ID,Address,ContactNumber).
The combined information (ID,NAME,ContactNumber,Address,Creditcard) have to be loaded in the main table. IF the record doesn't exist, ADD. But if some information is missing in the fields of the records present, UPDATE.
I was able to get some information from Lookup Video session uploaded.
But not able to do the required things.
Please help.
To join the data of your two sources you should use a "merge join" component or a "Lookup" component. It depends how many rows you've in both sources. Once your two sources have been joined you should write this result in a staging table. Then apply a sql merge statement between the staging and the final destination tables.
Probably not what you are looking for but if it is incremental loads, you can import the data to a "Stage" table and write a query to do a update insert into the active tables. Let it compare the Primary keys. If it is the same, test the fields for changes and update, if not, insert new row.
Hope it help.
I can't have a Staging table. That is a requirement.
Anyway I did make a partial solution for the problem.
We need to use 2 LookUp Transformations to get the desired result.
1 For collabarating data of the Flat file and the table that holds the partial data.
1 For checking for record existance based on the business key (i.e. ID (Primary Key))
Flat File Source --> LookUp (For collabaration) --> LookUp (For record check) --> OleDb Destination
The records that comes out in the (NO Match Output) are filled in the table.
I need to find out the way to update the records (Which come in the Match output)
If you guys can provide me a solution for it , it will be highly appreciated.

Merging tables from two Access databases into one new common

I have this assignment that I think someone should be able to help me. I have 5 ACCESS databases wvrapnaoh.accdb, wvrappaul.accdb, ....etc. These databases have about 45 tables each and 15 forms. The good part is the structure, the name and the fields of each table in all the databases are all the same except the data or the records are different. For example I have a stress table in wvrapnoah as well as wvrappaul with the same fields in both tables but different data or records.
So, I need to merge all these five into a new Access database that will have the same structure as the 5 databases but will include the complete data that is all the records from the 5 databases merged into this new database.The same applies to even the 15 forms. It does not seem to be having a primary key I guess. I was planning to add a field for each table that would give me the name of the database as well from which it was merged. Example I will add a DBName field in Wvrapnoah in all the tables and add the name Noah in that field for all the records in each table. I basically need to automate this code.
I need a script (VBA or anything) so that the guys creating these databases can just run this script the next time and merge the databases.
Talking about the 'table' part of the problem:
Questions
Are the databases / table names defined or you don't know them?
Are you able to use linked tables?
I believe the straightforward way to merge all of them is to link all tables into a single access DB and then run a UNION ALL query. It would be something like this:
SELECT "HANK", *
FROM MyTableHank
UNION ALL
SELECT "JOHN", *
FROM MyTableJohn;
Notice I defined a field to identify the origin of the data being merged ("HANK", "JOHN"), as you suggested above.
About the forms, I believe you'll need to import them and then review the whole code. It basically depends on what the forms are doing. If they're query-based won't be a big deal (importing / fixing the queries, will make the form works). However, if the forms are related to the tables, you'll have more work to do.