Update Data in table. Lookup ? Merge? - ssis

I am in need of a solution.
I am supposed to load the data of a table from PROD server to UAT. If records are missing in UAT, load the missing rows. How Should i go about it ?
Second Problem.
I am fetching some data (EmpId,NAME,CreditCardNumebr) from some text files. They are collaborated based on EmpId from a table in SQL Server (ID,Address,ContactNumber).
The combined information (ID,NAME,ContactNumber,Address,Creditcard) have to be loaded in the main table. IF the record doesn't exist, ADD. But if some information is missing in the fields of the records present, UPDATE.
I was able to get some information from Lookup Video session uploaded.
But not able to do the required things.
Please help.

To join the data of your two sources you should use a "merge join" component or a "Lookup" component. It depends how many rows you've in both sources. Once your two sources have been joined you should write this result in a staging table. Then apply a sql merge statement between the staging and the final destination tables.

Probably not what you are looking for but if it is incremental loads, you can import the data to a "Stage" table and write a query to do a update insert into the active tables. Let it compare the Primary keys. If it is the same, test the fields for changes and update, if not, insert new row.
Hope it help.

I can't have a Staging table. That is a requirement.
Anyway I did make a partial solution for the problem.
We need to use 2 LookUp Transformations to get the desired result.
1 For collabarating data of the Flat file and the table that holds the partial data.
1 For checking for record existance based on the business key (i.e. ID (Primary Key))
Flat File Source --> LookUp (For collabaration) --> LookUp (For record check) --> OleDb Destination
The records that comes out in the (NO Match Output) are filled in the table.
I need to find out the way to update the records (Which come in the Match output)
If you guys can provide me a solution for it , it will be highly appreciated.

Related

How to update multiple tables in database(Django) using single .csv file?

Table Image.
I want to distribute the data of above table into multiple tables.
Say :-
Product name and Company goes into 1st table.
Barcode and Price goes into 2nd table.
Category and Subcategory goes into 3rdtable.
One approach to solve your problem would be to implement a custom management command, you can check the documentation here. You could parse the CSV and then update the specific entries on your database.
The usage would be something like this (assuming for example that the command is called updateproducts):
$ python manage.py updateproducts path/to/your/file.csv
Of course that depending on the size of your data, other approaches might be more efficient.

Neo4J custom load CSV

I asked a question a few days ago to know how to import an existing database into Neo4J. Thanks to the person who explained me how to do that. I decided to create a CSV file from my database (around 1 million entries) and to load it from the Neo4j webadmin to test it. The problem is that each row of this database contains redundant data, for example my database contains actions from different users but each user can do mutliple actions. The structure of my graph would be to create a node for each user that is linked to each action he does. That's why I have to create only one node for each user even if his name appears in several rows of my CSV file (because he made several actions). What is the method to do that ? I guess it's possible to do that in Cypher right ?
Thanks a lot
Regards
Sam
In case you have references that might or might not exist, you should use the MERGE statement. MERGE either finds something or creates something in your database.
Please refer to the respective section in the reference manual: http://docs.neo4j.org/chunked/stable/cypherdoc-importing-csv-files-with-cypher.html. Here the country is shared my multiple users there the country is merged wheres the users and their relationships to countries are unconditionally created.

MS SQL Server: using CDC to populate single destination table from several source tables

Can I use Change Data Capture in MS SQL Server (2008 or 2012) with the SSIS Package which joins several source tables into one destination table?
Technet articles describe CDC + SSIS usage cases when source table and destination table have the same structure. The only hint to the possibility of change tracking for custom data transformations is that it is possible to specify the columns for which CDC will track changes.
The problem is, I need to combine data from a number of source tables to get the destination table and then keep it in sync with those source tables.
This is because the data in the destination datawarehouse is normalized to lesser extent than in the source database. For example, I have Events table (containting Computer ID, Date/Time, and Event Description) and Computers table (containting Computer ID and Computer Name). I don't need those normalized tables and computer ids in the destination table, so the select to fill the destination tables should be:
INSERT INTO DestDB..ComputerEvents (ComputerName, DateTime, Event)
SELECT s.ComputerName, e.DateTime, e.Event
FROM SourceDB..EventLog e
JOIN SourceDB..ComputerNames s
ON e.CompID = s.CompID
I just cannot figure out how to make CDC work with SSIS Package containing such transformation? Is it even possible?
To answer the question: No, you can't.
As one other responder has pointed out, CDC can only tell you what changed in EACH source table since the last time you extracted changes.
Using CDC to extract changes from multiple source tables to load a single destination table is anything but simple.
Let's show why by means of an example. For this example I assume that a staging table is a table that is truncated routinely before being populated.
Suppose we have two source tables: Order, OrderDetail. We have one destination fact table FactOrder. FactOrder contains the OrderKey (from Order) and the sum of order amount from OrderDetail. A customer orders 3 products. One Order and 3 OrderDetail records are inserted into the source database tables. Our DW ETL extracts the 1 order record (insert) and 3 OrderDetail records (insert). If we chose to load changed records into staging tables as a previous responder said we could simply join our staging tables to create our FactOrder record. But, what happens if the we no longer carry one of the products and someone deletes a record from the OrderDetail record. The next DW ETL extracts 1 OrderDetail record (delete). How do we use this information to update the target table? Clearly we can't join from Order to OrderDetail because Order has no record for this particular OrderKey since it is a staging table that we just truncated. I chose a delete example but consider the same problem if dependent tables are updated.
What I propose instead is to extract the distinct set of primary key (OrderKey in our example) values for which there are changes in the any of the source tables required to build the FactOrder record and then extract the full FactOrder record in a subsequent request. For example, if 5 Order records are changed we know the 5 OrderKey values. If 30 OrderDetail records are changed we need to determine the distinct set of OrderKey values. Let's say that is 10 OrderKey. We then union the two sets. Let's say that there is overlap so that yields 12 OrderKey values. Now we seed our FactOrder extract query with the 12 OrderKey values. We get back 12 complete FactOrder records. We then use comparison of new to stored binary checksum to determine how to action the 12 records (insert or update). The above approach does not cover deletes from the Order table. Those would result in trivial deletes from FactOrder.
The many examples out there as you noted show how to use CDC to replicate/synchronize data from 1 source to 1 destination which isn't a typical data warehouse load use case since the tables in the data warehouse are typically denormalized (thus requiring joins among multiple source tables to build the destination row).
OK first thing CDC captures changes in a table, so if there was some insert or delete or update in a table then a CDC record gets created with an indicator to say insert or update or delete and all the CDC task does is output records to one of the three outputs based on that indicator column so coming back to your question you might have to have multiple OLDEDB Sources and CDC Task for each Source and UNION ALL similar operations (insert,update, delete) together and then the Destination component or OLEDB Command component hope this helps :)
Consider CDC as if it were your automated mechanism for filling staging tables (instead of a sql query, or replication), using one CDC source table pointed at one regular staging table. From there simply build your joined queries against the multiple staging tables as needed.
My assumption is, that you are pulling data from non-identical tables, like
an Order table,
an OrderDetail table, etc.
If you are pulling from several identical tables in the same or different dbs, then you can push the output of the CDC directly into the staging table and you're done.

adding data to interrelated tables..easier way?

I am a bit rusty with mysql and trying to jump in again..So sorry if this is too easy of a question.
I basically created a data model that has a table called "Master" with required fields of a name and an IDcode and a then a "Details" table with a foreign key of IDcode.
Now here's where its getting tricky..I am entering:
INSERT INTO Details (Name, UpdateDate) Values (name, updateDate)
I get an error: saying IDcode on details doesn't have a default value..so I add one then it complains that Field 'Master_IDcode' doesn't have a default value
It all makes sense but I'm wondering if there's any easy way to do what I am trying to do. I want to add data into details and if no IDcode exists, I want to add an entry into the master table. The problem is I have to first add the name to the fund Master..wait for a unique ID to be generated(for IDcode) then figure that out and add it to my query when I enter the master data. As you can imagine the queries are going to probably get quite long since I have many tables.
Is there an easier way? where everytime I add something it searches by name if a foreign key exists and if not it adds it on all the tables that its linked to? Is there a standard way people do this? I can't imagine with all the complex databases out there people have not figured out a more easier way.
Sorry if this question doesn't make sense. I can add more information if needed.
p.s. this maybe a different question but I have heard of Django for python and that it helps creates queries..would it help my situation?
Thanks so much in advance :-)
(decided to expand on the comments above and put it into an answer)
I suggest creating a set of staging tables in your database (one for each data set/file).
Then use LOAD DATA INFILE (or insert the rows in batches) into those staging tables.
Make sure you drop indexes before the load, and re-create what you need after the data is loaded.
You can then make a single pass over the staging table to create the missing master records. For example, let's say that one of your staging table contains a country code that should be used as a masterID. You could add the master record by doing something along the lines of:
insert
into master_table(country_code)
select distinct s.country_code
from staging_table s
left join master_table m on(s.country_code = m.country_code)
where m.country_code is null;
Then you can proceed and insert the rows into the "real" tables, knowing that all detail rows references a valid master record.
If you need to get reference information along with the data (such as translating some code) you can do this with a simple join. Also, if you want to filter rows by some other table this is now also very easy.
insert
into real_table_x(
key
,colA
,colB
,colC
,computed_column_not_present_in_staging_table
,understandableCode
)
select x.key
,x.colA
,x.colB
,x.colC
,(x.colA + x.colB) / x.colC
,c.understandableCode
from staging_table_x x
join code_translation c on(x.strange_code = c.strange_code);
This approach is a very efficient one and it scales very nicely. Variations of the above are commonly used in the ETL part of data warehouses to load massive amounts of data.
One caveat with MySQL is that it doesn't support hash joins, which is a join mechanism very suitable to fully join two tables. MySQL uses nested loops instead, which mean that you need to index the join columns very carefully.
InnoDB tables with their clustering feature on the primary key can help to make this a bit more efficient.
One last point. When you have the staging data inside the database, it is easy to add some analysis of the data and put aside "bad" rows in a separate table. You can then inspect the data using SQL instead of wading through csv files in yuor editor.
I don't think there's one-step way to do this.
What I do is issue a
INSERT IGNORE (..) values (..)
to the master table, wich will either create the row if it doesn't exist, or do nothing, and then issue a
SELECT id FROM master where someUniqueAttribute = ..
The other option would be stored procedures/triggers, but they are still pretty new in MySQL and I doubt wether this would help performance.

SSIS Data Migration: Split Flat table into Parent + Child/Grandchild tables

I need to migrate data in a large flat table located in SQL Server 2005 into a new SQL Server 2005 schema that consists of a parent table and multiple child tables. This seems like the opposite of a merge or merge join in SSIS but I don't understand how I would go about accomplishing this. Any recommendations are greatly appreciated. Ever seen any examples of how other accomplish this sort of thing?
The flat Source table [FlatSource] has < 280K records and some garbage data so I will need to handle these things at some point. But for now, here is gist of what I need to accomplish...
The flat source table will mostly map to the new parent table [Parent]. That is to say: For each record in the [FlatSource], I need to move this record into [Parent].
Once this is done, I need to record the PK of this new parent record and add numerous child records. This PK will be used when adding 0-4 records into a child table [Child1]. Basically there may be 0-4 columns that if populated will require a new record in [Child1] that will use the PK from the [Parent].
Once this is done, I will need to populate 0-4 new records into [Grandchild] that will use the PK from [Child].
Thanks for any insight you can offer. I have started a project in C# but the more I dig into it, the more it seems like a task for SSIS.
Sincerely,
Josh Blair
Golden, CO
It looks like this would have been the task for a 'conditional splt' data flow task. This would have sat after your data source, and you would have added different splt conditions within the component itself.
When connecting destinations the the conditional split, you can specify which 'condition' is being recieved by the destination. As you can have many conditions, you can have many destinations.