Data Validation and Reconciliation in SSIS - ssis

I have to migrate data from one Non-SQL Server-database to SQL-Server database using ssis.
Data contains millions of rows.
However I want to make sure that data in the source and data in the destination remains same.
One of the answer that i followed is suggesting to use Staging Tables.
In addition to above technique What could be the best technique for doing this.
Any Thought/Suggestion would be appreciated.
Thanks

The staging area in the world of Data warehouse is the place where you just copy the data from the source for multiple reasons :
To only execute bulk copy from production server and then avoid to use too much ressources from production servers.
To keep the data unmodified during your calculation
To apply filter and other aggregation to prepare your queries that fill the DWH.
In your case, the staging area is a good idea to make the first step from non-sql to relationnal Database.
Moreover the staging is just a copy then you won't alter the integrity of the data during this step.
Because of this you can run some "integrity tests" after your migration by running count on the staging table and you final structure or by summing data and compare the global result to identify differences.

Related

django copy selected data from one table to another table

Can anybody help me. I want to know if there is a good solution to moving a large amount of data filtered from one table in an oracle db to another in a mysql db.
I know that you can run a query and loop over the results of it and insert it to the other database but the problem is that it may run out of memory and i'm looking for a good solution like running jobs or some asynchronous tasks.
You are looking for an ETL system (maybe without the T in your case)
Time ago I use pygrametl and it managed the data surprisingly fast.
Another option is django bulk create (don't try to insert all data at once, split data into chunks)

Transfer Data from two different servers in ssis

My requirement is to load data on daily basis from source table from one server to destination table in another server. Both servers are different i.e one is sql server and another one is Oracle server.
what i want is to make the source query fast. Suppose today I Execute the package I should get only the new records instead of all records from source. If I read the whole table it takes much time. Even I am using Lookup transforming to check the record exists or not it is taking much time.
Please look into this

Using Sql Server for data mining

I am working on a project where I am storing data in Sql Server database for data mining. I 'm at the first step of datamining, collecting data.
All the data is being stored currently stored in SQL Server 2008 db. The data is being stored in couple different tables at the moment. The table adds about 100,000 rows per day.
At this rate the table will have more than million records in about a month's time.
I am also running certain select statements against these tables to get upto the minute realtime statistics.
My question is how to handle such large data without impacting query performance. I have already added some indexes to help with the select statements.
One idea is to archive the database once it hits a certain number of rows. Is this the best solution going forward?
Can anyone recommend what is the best way to handle such data, keeping in mind that down the road I want to do some data mining if possible.
Thanks
UPDATE: I have not researched enough to decide what tool I would use for datamining. My first order of task is to collect relevant information. And then do datamining.
My question is how to manage the growing table so that running selects against it does not cause performance issues.
What tool you will you be using to data mine? If you use a tool that uses a relational source then you check the worlkload that it is submitting to the database and optimise based on that. So you don't know what indexes you'll need until you actually start doing data mining.
If you are using SQL Server data mining tools then they pretty much run off SQL Server cubes (which pre aggregate the data). So in this case you want to consider which data structure will allow you to build cubes quickly and easily.
That data structure would be a star schema. But there is additional work required to get it into a star schema, and in most cases you can build a cube off a normalised/OLAP structure OK.
So assuming you are using SQL Server data mining tools, your next step is to build a cube of the tables you have right now and see what challenges you have.

Best way for incremental load in ssis

I am getting 600,000 rows daily from my source and I need to dump them into the SQL Server destination, which would be an incremental load.
Now, as the destination table size is likely to be increase day by day which would be the best approach for the incremental load. I have few options in my mind:
Lookup Task
Merge Join
SCD
etc..
Please suggest me the best option which will perform well in incremental load.
Look at Andy Leonard's excellent Stairway to Integration Services series or Todd McDermid's videos on how to use the free SSIS Dimension Merge SCD component Both will address how to do it right far better than I could enumerate in this box.
Merge join is a huge performance problem as it requires sorting of all records upfront and should not be used for this.
We process many multimillion record files daily and generally place them in a staging table and do a hash compare to the data in our Change data tracking tables to see if the data is different from what is on prod and then only load the new ones or ones that are different. Because we do the comparison outside of our production database, we have very little impact on prod becasue uinstead of checking millions of records against prod, we are only dealing with the 247 that it actually needs to have. In fact for our busiest server, all this processing happens on a separate server except for the last step that goes to prod.
if you only need to insert them, it doesnt actually matter.
if you need to check something like, if exists, update else insert, I suggest creating a oleDbSource where you query your 600.000 rows and check if they exist with a lookup task on the existing datasource. Since the existing datasource is (or tend to be) HUGE, be careful with the way you configure the caching mode. i would go with partial cache with some memory limit ordered by the ID you are looking up (this detais is very important based on the way the caching works)

mysql optimization script file

I'm looking at having someone do some optimization on a database. If I gave them a similar version of the db with different data, could they create a script file to run all the optimizations on my database (ie create indexes, etc) without them ever seeing or touching the actual database? I'm looking at MySQL but would be open to other db's if necessary. Thanks for any suggestions.
EDIT:
What if it were an identical copy with transformed data? Along with a couple sample queries that approximated what the db was used for (ie OLAP vs OLTP)? Would a script be able to contain everything or would they need hands on access to the actual db?
EDIT 2:
Could I create a copy of the db, transform the data to make it unrecognizable, create a backup file of the db, give it to vendor and them give me a script file to run on my db?
Why are you concerned that they should not access the database? You will get better optimization if they have the actual data as they can consider table sizes, which queries run the slowest, whether to denormalise if necessary, putting small tables completely in memory, ...?
If it is a issue of confidentiality you can always make the data anomous by replacement of names.
If it's just adding indices, then yes. However, there are a number of things to consider when "optimizing". Which are the slowest queries in your database? How large are certain tables? How can certain things be changed/migrated to make those certain queries run faster? It could be harder to see this with sparse sample data. You might also include a query log so that this person could see how you're using the tables/what you're trying to get out of them, and how long those operations take.