SSIS to insert non-matching data on non-linked server - ssis

This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.

Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.

Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.

Related

How can I put regularly accessed data into a "quick access" area in a database

Very soon I will be building a database structure that will contain 2 million rows. Generally there are no more than 200 rows queried per minute and of those 200 it'll be 10-20 of those rows that are being queried.
Given the size of the table, I'd like to "store" the queried row somewhere so that any other end users querying this row will be able to get the row data "quicker". I then want this row to be accessed via this for a while and then put back into the main table once it's no longer in use. I believe this will make access quicker and more efficient.
Using the below schema, I'll provide an example. In this case row 1 has been accessed from the application layer. The application layer queries the "accessed" table to see if the row is there. If it is, it uses this and updates the "accessed" table with any changed data. If it isn't, it is queried from the main large table and dropped into the "accessed" table until the cron runs (say 10 minutes later) when all "accessed" data is copied into the main table and deleted from the accessed table.
http://sqlfiddle.com/#!2/d76f6/2
I'm trying to work out the following:
1) Will this show an increase in efficiency (I would imagine each query against "accessed" instead of the main will be significantly faster)?
2) What technology should be used for the "accessed" data storage? It's likely the main table will be stored in MariaDB/MySQL, however I'm happy to run it in flat files, sqlite, a different instance or keep it within the same instance... I'm open to suggestions that will make this more efficient, and in theory there's no reason the application layer couldn't act as an intermediary between any technologies
Premature optimization. Overcomplex design to start with. What you want to implement is a most frequently accessed cache system. However, the duty of a DMBS system is indeed to do these kind of system optimizations for you. There are already caches at disk level, file system level, and database level. What you are saying is that, even before having the system in place, you already know it is not going to perform as expected.
Maybe you know more than you state in your question, but on the face of it, optimizations should be done after, with suitable profiling.
There are a lot of ways to cache data.
On mysql you can use memory tables. Memory tables are much more faster than innodb-myisam tables
You can use memory based key value storage systems like redis, memcached
On application layer you can cache your data to filesystem

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

MySql, LOAD DATA or BATCH INSERT or any other better way for bulk inserts

I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.

Best way for incremental load in ssis

I am getting 600,000 rows daily from my source and I need to dump them into the SQL Server destination, which would be an incremental load.
Now, as the destination table size is likely to be increase day by day which would be the best approach for the incremental load. I have few options in my mind:
Lookup Task
Merge Join
SCD
etc..
Please suggest me the best option which will perform well in incremental load.
Look at Andy Leonard's excellent Stairway to Integration Services series or Todd McDermid's videos on how to use the free SSIS Dimension Merge SCD component Both will address how to do it right far better than I could enumerate in this box.
Merge join is a huge performance problem as it requires sorting of all records upfront and should not be used for this.
We process many multimillion record files daily and generally place them in a staging table and do a hash compare to the data in our Change data tracking tables to see if the data is different from what is on prod and then only load the new ones or ones that are different. Because we do the comparison outside of our production database, we have very little impact on prod becasue uinstead of checking millions of records against prod, we are only dealing with the 247 that it actually needs to have. In fact for our busiest server, all this processing happens on a separate server except for the last step that goes to prod.
if you only need to insert them, it doesnt actually matter.
if you need to check something like, if exists, update else insert, I suggest creating a oleDbSource where you query your 600.000 rows and check if they exist with a lookup task on the existing datasource. Since the existing datasource is (or tend to be) HUGE, be careful with the way you configure the caching mode. i would go with partial cache with some memory limit ordered by the ID you are looking up (this detais is very important based on the way the caching works)

MS Access databases on slow network: Is it faster to separate back ends?

I have an Access database containing information about people (employee profiles and related information). The front end has a single console-like interface that modifies one type of data at a time (such as academic degrees from one form, contact information from another). It is currently linked to multiple back ends (one for each type of data, and one for the basic profile information). All files are located on a network share and many of the back ends are encrypted.
The reason I have done that is that I understand that MS Access has to pull the entire database file to the local computer in order to make any queries or updates, then put any changed data back on the network share. My theory is that if a person is changing a telephone number or address (contact information), they would only have to pull/modify/replace the contact information database, rather than pull a single large database containing contact information, projects, degrees, awards, etc. just to change one telephone number, thus reducing the potential for locked databases and network traffic when multiple users are accessing data.
Is this a sane conclusion? Do I misunderstand a great deal? Am I missing something else?
I realize there is the consideration of overhead with each file, but I don't know how great the impact is. If I were to consolidate the back ends, there is also the potential benefit of being able to let Access handle referential integrity for cascading deletes, etc., rather than coding for that...
I'd appreciate any thoughts or (reasonably valid) criticisms.
This is a common misunderstanding:
MS Access has to pull the entire database file to the local computer in order to make any queries or updates
Consider this query:
SELECT first_name, last_name
FROM Employees
WHERE EmpID = 27;
If EmpID is indexed, the database engine will read just enough of the index to find which table rows match, then read the matching rows. If the index includes a unique constraint (say EmpID is the primary key), the reading will be faster. The database engine doesn't read the entire table, nor even the entire index.
Without an index on EmpID, the engine would do a full table scan of the Employees table --- meaning it would have to read every row from the table to determine which include matching EmpID values.
But either way, the engine doesn't need to read the entire database ... Clients, Inventory, Sales, etc. tables ... it has no reason to read all that data.
You're correct that there is overhead for connections to the back-end database files. The engine must manage a lock file for each database. I don't know the magnitude of that impact. If it were me, I would create a new back-end database and import the tables from the others. Then make a copy of the front-end and re-link to the back-end tables. That would give you the opportunity to examine the performance impact directly.
Seems to me relational integrity should be a strong argument for consolidating the tables into a single back-end.
Regarding locking, you shouldn't ever need to lock the entire back-end database for routine DML (INSERT, UPDATE, DELETE) operations. The database base engine supports more granular locking. Also pessimistic vs. opportunistic locking --- whether the lock occurs once you begin editing a row or is deferred until you save the changed row.
Actually "slow network" could be the biggest concern if slow means a wireless network. Access is only safe on a hard-wired LAN.
Edit: Access is not appropriate for a WAN network environment. See this page by Albert D. Kallal.
ms access is not good to use in local area network nor wide area network which certainly have lower speed. the solution is to use a client server database such as Ms SQL Server or MySQL. Ms SQL Server is much better than My SQL but it is not free. Consider Ms SQL server for large-scale projects. Again I said MS access is only good for 1 computer not for computer network.