Daily we get the data in excel formats we load the data into staging and then go to SSIS package
and take excel as connection manager and perform transformations and move the data to warehouse.
since we are taking data from excel only then why to create a stage and truncate it,
since we taking excel as source and every manipulation is done with in it? Can someone please
explain Real time scenario? I have seen many websites and couldn't understand what the concept is all about like
staging, source(excel),lookup target(warehouse)
Why to create to stage since everything is being done SSIS package only ?
The staging area is mainly used to quickly extract data from its data sources, minimizing the impact of the sources. After data has been loaded into the staging area, the staging area is used to combine data from multiple data sources, transformations, validations, data cleansing.
You can use a staging design pattern :
Incremental load
Truncate Insert
Using Delimiters with HashBytes for Change Detection
You can find out about the Package design pattern for loading a data warehouse
Related
I'm in a traditional Back to Front ETL stack from a data source (Adobe Analytics) to MySQL data warehouse to a Tableau front end for visualization. My question revolves around best practices for cleaning data / mapping and at what step.
1) Cleaning: We have no automated (SSIS, etc.) connector from the source (Adobe) to the data warehouse so we're left with periodic uploading of CSV files. For various reasons these files become less than optimal (misspellings, nulls, etc.) Question: should the 'cleaning' be done to the CSV files, or once the data is uploaded into MySQL data warehouse (in tables/views)?
2) Mapping: a number of different end user use cases require us to map the data to tables (geographic regions, type of accounts, etc.)... should this be done in the data warehouse (MySQL joins) or is it just as good in the front end (Tableau)? The real question pertains to performance, I believe, as you could do it relatively easily in either step.
Thanks!
1) Cleaning: I'd advise you to load the data in the CSV files into a staging database and clean it from there, before it reaches the database to which you connect Tableau to. This way you can keep the original files, which you can eventually reload if necessary. I'm not sure what a "traditional Back to Front ETL stack" is, but an ETL tool like Microsoft SSIS or Pentaho Data Integrator (free) will be a valuable help with building these processes, and you could then run your ETL jobs periodically or every time a new file is uploaded to the directory. Here is a good example of such a process: https://learn.microsoft.com/en-us/sql/2014/integration-services/lesson-1-create-a-project-and-basic-package-with-ssis
2) "Mapping": You should have a data model, probably a dimensional model, built on the database that Tableau connects to. This data model should store clean and "business modelled" data. You should perform the lookups (joins/mappings) when you are Transforming your data, so you can Load it into the data model. Having Tableau explore a dimensional model of clean data will also be better for UX/performance.
The overall flow would look something like: CSV -> Staging database -> Clean/Transform/Map -> Business data model (database) -> Tableau
This question can be seen as very stupid, but i'm actually strugling to make it clear into my head.
I have some academic experience with SSIS, SSAS and SSRS.
In simple terms:
SSIS - Integration of data from a data source to a data destination;
SSAS - Building a cube of data, which allows to analize and discover the data.
SSRS - Allows the data source to create dashboards with charts, etc...
Now, doing a comparison with Qlikview and Qliksense...
Can the Qlik products do exactly the same as SSIS, SSAS, SSRS? Like, can Qlik products do the extraction(SSIS), data proccessing(SSAS) and data visualization(SSRS)? Or it just works more from a SSRS side (creating dashboards with the data sources)? Does the Qlik tools do the ETL stages (extract, transform and load) ?
I'm really struggling here, even after reading tons of information about it, so any clarifications helps ALOT!
Thanks,
Anna
Yes. Qlik (View and Sense) can be used as ETL tool and presentation layer. Each single file (qvw/View and qvf/Sense) contains the script that is used for ETL (load all the required data from all data sources, transform the data if needed), the actual data and the visuals.
Depends on the complexity, only one file can be used for everything. But the process can be organised in multiple files as well (if needed). For example:
Extract - contains the script for data extract (eventually with incremental load implemented if the data volumes are big) and stores the data in qvd files
Transform - loads the qvd files from the extraction process (qvd load is quite fast) and perform the required transformations
Load - load the data model from the transformation file (binary load) and create the visualisations
Another example of multiple files - had a project which required multiple extractors and multiple transformation files. Because the data was extracted from multiple data sources to speed up the process we've ran all the extractors files are the same time, then ran all the transformation files at the same time, then the main transform (which combined all the qvd files) into a single data model.
In addition to the previous comment have a look at the layered Qlik architecture.
There it is described quite well how you should structure your files.
However, I would not recommend to use Qlik for a full-blown data-warehouse (which you could do with SSIS easily) as it lacks some useful functions (e.g. helpers for slowly-changing-dimensions).
I am able to migrate data between two SQL Server tables easily using a SSIS data flow task. Can I use format files to specify the columns to choose from the source and destination? If so, can you give me an example?
In our current system, our Source and Destination tables are always not the same. We were using SQL-DMO with format files so far and are now upgrading to SSIS.
Thanks in advance for your suggestions.
So I think that you can look up info on how to create a format file here: http://msdn.microsoft.com/en-us/library/ms191516.aspx
Google SSIS Bulk Insert Task to find more on that.
I would recommend using a data flow if you can because this can eliminate columns from the source that do not exist in the destination and it can out perform bulk inserts. It's worth consideration.
Mark
Here is a post to which I just finished answering my own question and thought will link the two posts together.
SSIS - Export multiple SQL Server tables to multiple text files
I am now working with SSIS in the stage of loading data from sources into our data warehouse staging. I am not sure that they are any kinds of features for controlling staging process e.g. control the working tables, write to logging tables, separate batches for the data, merge each batch together...
Right now we are using our own written store procedures to control these steps for staging. Can any of you give me suggestion for this?
I typically use RAW files for staging, and many similar tasks. The link below has a nice summary.
http://www.jasonstrate.com/2011/01/31-days-of-ssis-raw-files-are-awesome-131/
This question is going to be a purely organizational question about SSIS project best practice for medium sized imports.
So I have source database which is continuously being enriched with new data. Then I have a staging database in which I sometimes load the data from the source database so I can work on a copy of the source database and migrate the current system. I am actually using a SSIS Visual Studio project to import this data.
My issue is that I realised the actual design of my project is not really optimal and now I would like to move this project to SQL Server so I can schedule the import instead of running manually the Visual Studio project. That means the actual project needs to be cleaned and optimized.
So basically, for each table, the process is simple: truncate table, extract from source and load into destination. And I have about 200 tables. Extractions cannot be parallelized as the source database only accepts one connection at a time. So how would you design such a project?
I read from Microsoft documentation that they recommend to use one Data Flow per package, but managing 200 different package seems quite impossible, especially that I will have to chain for scheduling import. On the other hand a single package with 200 Data Flows seems unamangeable too...
Edit 21/11:
The first apporach I wanted to use when starting this project was to extract my table automatically by iterating on a list of table names. This could have worked out well if my source and destination tables had all the same schema object names, but the source and destination database being from different vendor (BTrieve and Oracle) they also have different naming restrictions. For example BTrieve does not reserve names and allow more than 30 characters names, which Oracle does not. So that is how I ended up manually creating 200 data flows with a semi-automatic column mapping (most were automatic).
When generating the CREATE TABLE query for the destination database, I created a reusable C# library containing the methods to generate the new schema object names, just in case the methodology could automated. If there was any custom tool to generate the package that could use an external .NET library, then this might do the trick.
Have you looked into BIDS Helper's BIML (Business Intelligence Markup Language) as a package generation tool? I've used it to create multiple packages that all follow the same basic truncate-extract-load pattern. If you need slightly more cleverness than what's built into BIML, there's BimlScript, which adds the ability to embed C# code into the processing.
From your problem description, I believe you'd be able to write one BIML file and have that generate two hundred individual packages. You could probably use it to generate one package with two hundred data flow tasks, but I've never tried pushing SSIS that hard.
You can basically create 10 child packages each having 20 data flow tasks and create a master package which triggers these child pkgs.Using parent to child configuration create a single XML file configuration file .Define the precedence constraint for executing the package in serial fashion in master pkg. In this way maintainability will be better compared to having 200 packages or single package with 200 data flow tasks.
Following link may be useful to you.
Single SSIS Package for Staging Process
Hope this helps!