i have a tabular cube which takes a long time to processing, my idea is to process only new data every hour and a full process during the night, is there a way to do that with SSIS and SQL Job?
Assuming your "new rows" are inserts to your fact table rather than updates or deletes you can do a ProcessAdd operation. ProcessAdd will take a SQL query you provide that returns the new rows and add them to your table in SSAS Tabular.
There are several ways to automate this, all of which could be run from SSIS. This article walks through the options well.
If you have updates and deletes then you need to partition your table inside SSAS. For example partition by week then only reprocess (ProcessData) the partitions where any rows have been inserted/updated/deleted.
Related
I have an ETL process which takes data from transaction db and keeps after processing stores the data to another DB. While storing the data we are truncating the old data and storing new data to have better performance, as update takes a lot of time than truncate insert. So in this process we experience counts as 0 or wrong data for some time (like for 2 3 mins). We are running the ETL in every 8 hours.
So how can we avoid this problem? How can we achieve zero downtime?
One way we did use in the past was to prepare the prod data in a table named temp. Then when finished (and checked, that was the lengthy part in our process), drop prod and rename the temp in prod.
Takes almost no time, and the process was successful even in case some other users were locking the table.
During a month a process inserts a large number of rows in some database tables ~1M.
This happens daily and the whole process lasts ~40mins. That is fine.
I created some "summary tables" from these inserts so as to query the data fast. This works fine.
Problem: I keep inserting data in the summary tables and so the time to create the cache table matches the process to insert the actual data and this is good. But if the data inserted in the previous days have changed (due to any updates) then I would need to "recalculate" the previous days and to solve this instead of creating today's summary data daily I would need to change my process to recreate the summary data from the beginning of each month which would mean my running time would increase substantially.
Is there a standard way to deal with this problem?
We had a similar problem in our system, which we solved by generating a summary table holding each day's summary.
Whenever an UPDATE/INSERT changes the base tables the summary table is updated.. this will of course slow down these operations but keeps the summary table completely up to date.
This can be done using TRIGGERs, but as the operations are in one place, we just do it manually in a TRANSACTION.
One advantage of this approach is that there is no need to run a cron job to refresh/create the summary table.
I understand that this may not be applicable/feasible for your situation.
Hi I have around 1000 million record in production Table and around 1500 million record in Staging table,Here I need to compare staging data with production data if any new record then insert it and if there are some changes on record then update the column or insert a new column.So what is the best approach for this condition?Please suggest.
If you want to get a copy of a table from one database to another four times a year, one way is to simply truncate and reload the target table. Don't mess around trying to update individual rows, it's often quicker to delete the lot and reload them.
You need to do some analysis and see if there is any field you can use in the source to reliably load a subset of data. For example if there is a 'created date' on the record you can use that just load the data created in the last three months data (instead of all of it)
If you add some more info then we can be more specific about a solution.
i.e.
-This has to be fast but we have lots of disk space
-This must be simple to maintain and can take a day to load
Also... I assume the source and target is SQL Server? If so is it the Enterprise edition?
I have a question.I'm working on loading a Very large table having existing data of the order of 150 million records which will keep on growing by adding 1 million records on a daily basis.Few days back the ETL started failing even after running for 24 hrs. In the DFT, we have source query pulling 1 million records which is LOOKed UP against the Destination table having 150 million records to check for new records. It is failing as the LOOKUP cannot hold data for 150 million records. I have tried changing the LOOKUP to Merge Join without success. Can you please suggest alternative designs to load the data in the large table successfully. Moreover, there is no way I can reduce the size of destination table. I already have indexes on all required columns. Hope I'm clear in explaining the scenario.
You can try table partitioning to import a large amount of data. Take a look here for an example. Another link that might be useful. Also, check msdn for more information about creating and maintaining partitioned tables.
Separate the journey into two.
Create a new data flow task (or separate package) to stage your source table "as is" in an environment close to your target database
Alter your existing data flow task to query across both your staged table and the target table to pull out only new (and changed?) records and process these accordingly within SSIS. This should eliminate the need to have the expensive look-up performing a round trip to the database per record
I have a table with 4mil+ records. There is a staging table that gets updated with data via an ETL process throughout the day. Once the staging table gets updated, I need to then sync that data with the production table. I'm currently using an INSERT/ON DUPLICATE KEY UPDATE query to sync them, however with the size of this table it takes ~750 seconds to run. Is there a more efficient way to update/insert the new data? I have read some about partitioning tables, but I'm not sure if that's what I need to do or not. Can anyone give me some suggestions on how to do accomplish this more efficiently?
I would use the maatkit tools (http://www.maatkit.org/), specifically http://www.maatkit.org/doc/mk-table-sync.html. It is pretty efficient at this sort of thing.