what design to follow to create a Datawarehouse - sql-server-2008

I am working on a project to create a data warehouse. I have been using a third party tool to create OLAP cubes, but the problem is it creates separate staging area for each OLAP cube, while most of my cubes share same source of data. The company decided to make a single data warehouse and then the cubes will source data from this warehouse.
I will be extracting data from different sources, and store them in a database(staging area), then I transform this data to appropriate tables for dimensions and facts and store them in a separate database called Data warehouse, and then I will create Individual cubes by sourcing data from data warehouse.
My concern here is can I make different databases for staging area and database warehouse though they will be on the same server?
Plus what about my data marts do I need to have all data marts into same warehouse or I can have them into different databases, I want to know here about logical and physical separations, and best practices.

It sounds like you've been prescribed something along the lines of the Corporate Information Factory from Bill Inmon's data warehouse solution.
http://www.inmoncif.com/library/cif/
Read EDW as your single data warehouse and Departmental Datamarts as your individual cubes.
You could normalise (and perform much of the transformation) as you load the first data warehouse and use this as your centralised data.
When loading your cubes you could choose from several methods of presenting the data for load. Staging to a new single database as you have described, creating views on top of the central data warehouse to read data from or creating a separate staging area for each cube.
Keep in mind the purpose of separate cubes may be to isolate departments from each other to enable rapid concise developments for individual departments and the purpose of the first central data warehouse may be to reconcile disparate data into one agreeable dataset before using for many reporting purposes.
Speak with the people who decided on your architecture to see what they had in mind or the problems they wanted to solve.

For designing a data warehouse.
First we should know the utility of data warehouse(means what type of report we are going to take from the system)
we need to chose schema (STAR or SNOWFLAK)
we need to create dimensions of data warehouse.
we need to create a fact table where all transactional data will be stored.

Related

Storing Visualizations and Analysis in Database

I am currently working on a web-application that would allow users to analyze & visualize data. For example, one of the use-cases is that the user will perform a Principal Component Analysis and store it. There can be other such analysis like a volcano plot, heatmap etc.
I would like to store these analysis and visualizations in a database in the back-end. The challenge that I am facing is how to design a relational database schema which will do this efficiently. Here are some of my concerns:
The data associated with the project will already be stored in a normalized manner so that it can be recalled. I would not like to store it again with the visualization.
At the same time, the user should be able to see what is the original data behind a visualization. For eg. what data was fed to a PCA algorithm? The user might not use all the data associated with the project for the PCA. He/she could just be doing this on a subset of the data in the project.
The number of visualizations associated with the webapp will grow with time. If I need to design an invoved schema everytime a new visualization is added, it could make overall development slower.
With these in mind, I am wondering if I should try to solve this with a relational database like MySQL at all. Or should I look at MongoDB? More generally, how do I think about this problem? I tried looking for some blogs/tutorials online but couldn't find much that was useful.
The first step you should do before thinking about technical design, including a relational or non-SQL platform, is a data model that clearly describes the structure and relations between your data in a platform independent way. I see the following interesting points to solve there:
How is a visualisation related to the data objects it visualizes? When the visualisation just displays the data of one object type (let's say the number of sales per month), this is trivial. But if it covers more than one object type (the number of sales per month, product category, and country), you will have to decide to which of them to link it. There is no single correct solution for this, but it depends on the requirements from the users' view: From which origins will they come to find this visualisation? If they always come from the same origin (let's say the country), it will be enough to link the visuals to that object type.
How will you handle insertions, deletes, and updates of the basic data since the point in time the visualisation has been generated? If no such operations relevant to the visuals are possible, then it's easy: Just store the selection criteria (country = "Austria", product category = "Toys") with the visual, and everyone will know its meaning. If, however, the basic data can be changed, you should implement a data model that covers historizing those data, i.e. being able to reconstruct the data values on which the original visual was based. Of course, before deciding on this, you need to clarify the requirements: Will, in case of changed basic data, the original visual still be of interest or will it need to be re-generated to reflect the changes?
Both questions are neither simplified nor complicated by using a NOSQL database.
No matter what the outcome of those requirements and data modeling efforts are, I would stick to the following principles:
Separate the visuals from the basic data, even if a visual is closely related to just one set of basic data. Reason: The visuals are just a consequence of the basic data that can be re-calculated in case they get lost. So the requirements e.g. for data backup will be more strict for the basic data than for the visuals.
Don't store basic data redundantly to show the basis for each single visual. A timestamp logic with each record of basic data, together with the timestamp of the generated visual will serve the same purpose with less effort and storage volume.

BI architecture advice

I would like to get some advice on our BI architecture, which is pretty complex to maintain.
I work in a e-shopping company, our prod works on a LAMP system (Linux Apache MySQL PHP).
Every night:
data from our prod DB (MySQL) are extracted with Talend then inserted in another MySQL database named DWH for data warehouse
data from this DWH are then extracted by Jedox/Palo to load OLAP cubes, which are used by Excel with a plugin for restitution
data from this DWH are also accessed by Access/Excel one-shot reports, but this is not working very fine
Each time we need to modify an existing workflow or create a new report, there is a lot of steps and different technologies to use, which leads us to a pretty complicated platform.
What can I do to simplify this process?
You should be able to load the Palo OLAP tables with Talend as part of the data warehousing load process using the Palo components provided which should cut down on doing ETL stuff in a separate way to your main ETL process.
Roberto Machetto's blog has some interesting tutorials on how to do this and I'm sure there's plenty more examples across the internet and of course you could ask here for any specific issues you're having.
Once all of your data is properly loaded into the data warehouse and any OLAP cubes then your users should be able to run any bespoke or otherwise queries against the data as it's stored. If you're seeing instances where users don't have access to the proper data for their analysis then that should be resolved in the initial data warehouse/OLAP cube load by proper understanding of dimensional modelling.
It's a little bit difficult give advices about what or not to do, it depends on your final target or objective. What I could recommend you is to separate your data into some stages before delivery your OLAP cubes.
For example, you could create facts and dimensions into this DW database that you have, so you could separate your data into subjects, what could make your reports building much easier, since anyone could group this data as needed. You could have a fact table just for sales, another for churn, another for new customers, and go on...
Try to take a look on fact and dimensions tables, or just dimensional modeling, it will make your daily work a lot of easier.
Some links:
http://en.wikipedia.org/wiki/Dimensional_modeling
http://www.learndatamodeling.com/ddm.php

Why do we need SSIS and star schema of Data Warehouse?

If SSAS in MOLAP mode stores data, what is the application of SSIS and why do we need a Data Warehouse and the ETL process of SSIS?
I have a SQL Server OLTP database. I am using SSIS to transfer my SQL Server data from OLTP database to a Data Warehouse database that contains fact and dimension tables.
After that I want to create cubes using SSAS form Data Warehouse data.
I know that MOLAP stores data. Do I need any Data warehouse with Fact and Dimension tables?
Is not it better to avoid creating Data warehouse and create cubes directly from OLTP database?
This might be a candidate for "Too Broad" but I'll give it a go.
Why would I want to store my data 3 times?
I have my data in my OLTP (online, transaction processing system), why would I want to move that data into a completely new structure (data warehouse) and then move it again into an OLAP system?
Let's start simple. You have only one system of record and it's not amazingly busy. Maybe you can get away with an abstraction layer (views in the database or named queries in SSAS) and skip the data warehouse.
So, you build out your cubes, dimensions and people start using it and they love it.
"You know what'd be great? If we could correlate our Blats to the Foos and Bars we already have in there" Now you need to integrate your simple app with data from a completely unrelated app. Customer id 10 in your app is customer id {ECA67697-1200-49E2-BF00-7A13A549F57D} in the CRM app. Now what? You're going to need to present a single view of the Customer to your users or they will not use the tool.
Maybe you rule with an iron fist and say No, you can't have that data in the cube and your users go along with it.
"Do people's buying habits change after having a child?" We can't answer that because our application only stores the current version of a customer. Once they have a child, they've always had a child so you can't cleanly identify patterns before or after an event.
"What were our sales like last year" We can't answer that because we only keep a rolling 12 weeks of data in the app to make it manageable.
"The data in the cubes is stale, can you refresh it?" Egads, it's the middle of the day. The SSAS processing takes table locks and would essentially bring our app down until it's done processing.
Need I go on with these scenarios?
Summary
The data warehouse serves as an integration point for diverse systems. It has conformed dimensions (everyone's has a common definition for what a thing is). The data in the warehouse may exceed the lifetime of the data in the source systems. The business needs might drive the tracking of data that the source application does not support. The data in the DW supports business activities while your OLTP system supports itself.
SSIS is just a tool for moving data. There are plenty out there, some better, some worse.
So No, generally speaking, it is not better to avoid creating a DW and build your cubes based on your OLTP database.

What approach should I follow for report SSRS , SSIS ETL and Data warehouse

I'm working on SSRS.Actually I'm new to this.We have a OLTP database in which we have created stored procedure for each report.These stored procedures are used to create DataSet in BI solution to run the report.
Now we were asked to go through SSIS process ( ETL ) and Data Warehouse concept and all reports will now be running through these two approach.
So my doubt is:
1) As per my knowledge in SSIS , we have to create a new database and new tables for each report.Through packages (which include ETL process) we will insert all data into each tables and finally will fetch report data from these table directly.
This approach speed up data retrieval process because data is already calculated for every reports and do not need to design Data Warehouse.
Am I right?
2) Do we really need to run all reports through SSIS and Data Warehouse approach i.e. how can i judge which report need to run through SSIS and Data Warehouse approach OR can continue running report with OLTP system.
3) Any best article link for SSIS and Data warehouse concept
4) Do I have to first create SSIS packages before designing Data warehouse.
Thanks
1) I'm not sure you want a table per report. I guess you might end up with this if non of your reports used the same fields. When I hear data warehouse, I think dimensional model/star schema. The benefit of a star schema is that it simplifies the data model and reduces the amount of joins you might have to go through to get the data you need, optimizing for data retrieval.
2) The answer to this question depends on your goals. Many companies with a data warehouse try to do all non-real-time reporting out of their data warehouse or an ODS to reduce the load on the production OLTP system. If optimized reliability and speed of report delivery is the goal, then test query speeds, data integrity, and accuracy and decide if a data warehouse with ETL provides a better experience (and if that justifies the monitoring and maintenance required for a data warehouse).
3) For data warehouse concepts, try the Kimball Group. For SSIS, start with MSDN and make sure to visit the SSIS Package Essentials page.
4)You should design your data warehouse before you build SSIS packages. You might have to make a few tweaks as you get into the ETL process, but you generally know what you want to end up with (your DW design) and use SSIS to get the data to that desired end state.

Data mart vs cubes

I've got confused with warehousing process... I'm in a process of building a data mart but the part I don't really understand is related to cubes. I've read some tutorials about SSAS but I don't see how can I use this data in other applications. What I need is following:
A warehouse (data mart) that contains all data needed for analysis (drill down and aggregated data, like a daily revenue and YTD revenue)
A .NET web service that can take this data so that many different apps can use it
A part I don't understand are cubes. I see that many people use SSAS to build cubes. What are these cubes in SSAS? Are they objects? Are they tables where data is stored? How can my web service access the data from cubes?
Are there any alternatives to SSAS? Would it be practical to just build cubes in a data mart and load them during ETL process?
Cubes are preaggregated stores of data in a format to make reporting much more efficient than is possible in a Relational database store. In SSAS you have several choices for how your data is ultimately stored, but generally they are stored in files in the OS file system. They can be queried similarly to SQL (using a specialized query language called MDX) or by several other methods depending upon your version level. You can set up connections to the data for your web service using the appropriate drivers from Microsoft. I am unsure of what you are meaning by data mart. Are you referring to relational table in a star schema format? If so, these are generally precursors to the actual cube. You will not get as much benefit from a reporting standpoint by using these relational sources as you would from a cube (since a cube stores the aggregates of each node (or tuple) within the dimensional space defined by your star schema tables) To explain this, if I have a relational store (even in star schema format) and I want to get sales dollars for a particular location for a particular date, I have to run a query against a very large sales fact table and join the location and date dimesion tables (which may also be very large). If I want the same data from a cube, I define my cube filters and the datawarehouse query pulls that single tuple from the data and returns it much more quickly.
There are many alternatives to SSAS, but each would be a form of a cube if you are using a datawarehouse. If you have a large data set, a cube, properly designed will out perform a relational datamart for multidimensional queries.