Convert SQL SELECT statement to SSIS Data Flow Task - ssis

in the context of a traditional ETL I wrote several views (SQL SELECTs with some JOINs) that I would like to "translate" into Data Flow Tasks in SSIS.
Is there any tool to automate this?
Thanks in advance

Changing SQL views into Data Flows in SSIS is generally a very bad idea. The SQL Engine is very good at fetching your data in the optimal way (on most cases) so you have the data as fast as possible. It uses indexes statistics, cache'd data and so on.
Obtaining the same result in a Data Flow will require to fetch the data from different tables and then do filtering and joining. Even if the SSIS is fast, it will always be faster to solve the query directly on the database (as long as it involves just 1 server) since they were build specifically for that purpose. It will be also a lot harder to design and to mantain (SSIS graphical interface Vs. SQL).
You should try to use SSIS just for its purpose, integrate data, and leave the heavy processing to your databases.
As for your question about automation, I don't think there is a tool to translate SQL views into DTS components, you will have to design them manually.

Related

Can we do all the things which we can do in BizTalk using SSIS

I have been using SSIS for a while, and I have never came across BizTalk.
One of the data migration project we are doing, also consists of BizTalk, apart from SSIS.
I just wondered what is the need of BizTalk, if we already have a SSIS ETL tool.
SSIS is well suited for bulk ETL batch options where you're transfering data between a SQL Server and
Another RDBMS
Excel
A simple CSV file
You do not need row by row processing
Your mapping is primarily data type conversion mapping (i.e. changing VARCHAR to NVARCHAR or DATETIME to VARCHAR etc.)
You're ok with error/fault handling for batches rather than rows
You're doing primarily point to point integrations that are unlikely to change or will only be needed temporarily.
BizTalk is well suited for real time messaging needs where:
You're transferring messages between any two end points
You need a centralized hub and/or ESB for message processing
You need fine grained transformations of messages
You need to work with more complicated looping file structures (i.e. not straight up CSV)
You need to apply analyst manageable business rules
You need to be able to easily swap out endpoints at run time
You need more enhanced error/fault management for individual messages/rows
You need enhanced B2B capabilities (EDI, HL7, SWIFT, trading partner management, acknowledgements)
Both can do the job of the other with a lot of extra work, but to see this, try to get SSIS to do a task that would require calling a stored procedure per row and have it do proper error handling/transformation of each row, and try to have BizTalk do a bulk ETL operation that requires minimal transformation. Both can do either, but it will be painful.
The short answer, no.
BizTalk Server and SSIS are different paradigms and are used to complement each other, not in opposition. They are both part of the BizTalk Stack and are frequently used in the same app.
BizTalk is a messaging platform and app will tend to process one entity at a time. SSIS is set based and works best for bulk table based operations.

using JPA ( hibernate) VS stored procedures

I am working on a project using ZK Framework, Hibernate, Spring and Mysql.
I need to generate some charts from Mysql database, but after I calculate the number of objects that I need to calculate the values of those charts I found it more than 1400 objects and same numbers of queries and transactions.
So i thought if using stored procedures in Mysql to calculate those values and save them in a separate tables (using an architecture close to Data Warehouse), and then use my web application to just read the values of those tables and display them as charts.
I want to know in terms of speed and performance, which of those methods is better?
And thank you
No way to tell, really, without many more details. However:
What you want to do is called Denormalisation. This is a recognised technique for speeding up reporting and making it easier. (If it doesn't, your denormalisation has failed!) When it works it has the following advantages:
Reports run faster
Report code is easier to write
On the other hand:
Report Data is out of date, containing only data as at the time you
last did the calculations
An extreme form of doing this is to take the OLTP database (a standard database) and export it into an Analysis database (aka a Cube or an OLAP database).
One of the problems of Denormalisation is that a) it is usually a significant effort, b) it adds extra code which adds complexity and thus increases support costs, and c) it might not make enough (or any) difference. Because of this, it is usual not to do it until you know you have a problem. This will happen when you have done your reports on the basic database and have found that they either are too difficult to write and/or run too slowly. I would strongly suggest that only when you reach that point do you go for Denormalisation.
There can be times when you don't need to do that, but I've only seen 1 such example in over 25 years of development; and that decision was helped by a desire to use an OLAP database by Management for political purposes.

SSIS and ETL speed

I read plenty of articles stataing that SSIS and ETL are much faster and more efficient than using VB6 recordsets and VB.NET DataReaders, however I do not fully understand why this is the case.
I created an SSIS package that looped through one million records and created a new table and did the same in VB and this confirmed that SSIS is very fast.
I understand that all the processing is done in the data tier so there are no costly trips from the application server to the database server, but is there an MSDN article that expalins the algorithm that makes SSIS a lot quicker?
I have a VB6 app that is very slow and think SSIS is the solution.
The pipeline architecture of the SSIS Data Flow Task is faster due mainly to buffering. By selecting the data in "chunks", the pipeline can perform many operations in RAM, then pass the data buffer downstream for further processing. Depending on the size and shape of the data, and the location and type of the source and destination, you can sometimes achieve better results outside of SSIS.

DTS/SSIS vs. Informatica Power Center

I'm sure that this is a pretty vague question that is difficult to answer but I would be grateful for any general thoughts on the subject.
Let me give you a quick background.
A decade ago, we used to write data loads reading input flat files from legacy applications and load them into our Datamart. Originally, our load programs were written in VB6 and cursored through the flat file and for each record, performed this general process:
1) Look up the record. If found, update it
2) else insert new record
Then we ended up changing this process to use SQL Server to DTS the flat file in a temp table and then we would perform a massive set base join on the temp table with the target production table, taking the data from the temp table and using it to update the target table. Records that didn't join were inserted.
This is a simplification of the process, but essentially, the process went from an iterative approach to "set based", no longer performing updates 1 record at a time. As a result, we got huge performance gains.
Then we created what was in my opinion a powerful set of shared functions in a DLL to perform common functions/update patterns using this approach. It greatly abstracted the development and really cut down on the development time.
Then Informatica PowerCenter, an ETL tool, came around and mgt wants to standardize on the tool and rewrite the old VB loads that used DTS.
I heard that PowerCenter processes records iteratively, but I know that it does do some optimization tricks, so I am curious how Informatica would perform.
Does anyone have any experience with using DTS or SSIS to be able to make a gut performance predition as to which would generally perform better?
I joined an organization that used both Informatica PowerCenter 8.1.1. Although I can't speak for general Informatica setups, I can say that at this company Informatica was exceedingly inefficient. The main problem is that Informatica generated some really henious SQL code in the back-end. When I watched what it was doing with profiler and from reviewing the text logs, it generated separate insert, update, and delete statements for each row that needed to be inserted/updated/deleted. Instead of trying to fix the Informatica implementation, I simply replaced it with SSIS 2008.
Another problem I had with Informatica was managing parallelization. In both DTS and SSIS, parallelizing tasks was pretty simple -- don't define precedence constraints and your tasks will run in parallel. In Informatica, you define a starting point and then define the branches for running processes in parallel. I couldn't find a way for it to limit the number of parallel processes unless I explicitly defined them by chaining the worklets or tasks.
In my case, SSIS substantially outperformed Informatica. Our load process with Informatica took about 8-12 hours. Our load process with SSIS and SQL Server Agent Jobs was about 1-2 hours. I am certain had we properly tuned Informatica we could have reduced the load to 3-4 hours, but I still don't think it would have done much better.

Large Analytics Database Responsive Retrieval (MYSQL)

I want to create a 'google analytics' type application for the web - i.e. a web-based tool to do some reporting and graphing for my database. The problem is that the database is HUGE, so I can't do the queries in real time because they will take too long and the tool will be unresponsive.
How can I use a cron job to help me? What is the best way to be able to make my graphs responsive? I think I will need to denomalize some of my database tables, but how do I make these queries faster? What intermediate values can I store in another database table to make it quicker?
Thanks!
Business Intelligence (BI) is a pretty mature discipline - and you'll find answers to your questions in any book on scaling databases for reporting & data warehousing.
A high-level list of tactics would include:
partitioning (because indexes are little help for most reporting)
summary tables (generated usually through a batch process submit via cron)
you need a good optimizer (some databases like mysql don't - so make poor joining decisions)
query parallelism (some databases will provide linear speedups just by splitting your query into multiple threads)
star-schema - a good data model is crucial to good performance
In general dynamic reporting beats the pants off static reporting - so if you're after powerful reporting I'd just try to copy data into an appropriate model, use aggregates, possibly change the database to get a good optimizer and the appropriate features rather than run reports in batch.