I would like to get some advice on our BI architecture, which is pretty complex to maintain.
I work in a e-shopping company, our prod works on a LAMP system (Linux Apache MySQL PHP).
Every night:
data from our prod DB (MySQL) are extracted with Talend then inserted in another MySQL database named DWH for data warehouse
data from this DWH are then extracted by Jedox/Palo to load OLAP cubes, which are used by Excel with a plugin for restitution
data from this DWH are also accessed by Access/Excel one-shot reports, but this is not working very fine
Each time we need to modify an existing workflow or create a new report, there is a lot of steps and different technologies to use, which leads us to a pretty complicated platform.
What can I do to simplify this process?
You should be able to load the Palo OLAP tables with Talend as part of the data warehousing load process using the Palo components provided which should cut down on doing ETL stuff in a separate way to your main ETL process.
Roberto Machetto's blog has some interesting tutorials on how to do this and I'm sure there's plenty more examples across the internet and of course you could ask here for any specific issues you're having.
Once all of your data is properly loaded into the data warehouse and any OLAP cubes then your users should be able to run any bespoke or otherwise queries against the data as it's stored. If you're seeing instances where users don't have access to the proper data for their analysis then that should be resolved in the initial data warehouse/OLAP cube load by proper understanding of dimensional modelling.
It's a little bit difficult give advices about what or not to do, it depends on your final target or objective. What I could recommend you is to separate your data into some stages before delivery your OLAP cubes.
For example, you could create facts and dimensions into this DW database that you have, so you could separate your data into subjects, what could make your reports building much easier, since anyone could group this data as needed. You could have a fact table just for sales, another for churn, another for new customers, and go on...
Try to take a look on fact and dimensions tables, or just dimensional modeling, it will make your daily work a lot of easier.
Some links:
http://en.wikipedia.org/wiki/Dimensional_modeling
http://www.learndatamodeling.com/ddm.php
Related
If SSAS in MOLAP mode stores data, what is the application of SSIS and why do we need a Data Warehouse and the ETL process of SSIS?
I have a SQL Server OLTP database. I am using SSIS to transfer my SQL Server data from OLTP database to a Data Warehouse database that contains fact and dimension tables.
After that I want to create cubes using SSAS form Data Warehouse data.
I know that MOLAP stores data. Do I need any Data warehouse with Fact and Dimension tables?
Is not it better to avoid creating Data warehouse and create cubes directly from OLTP database?
This might be a candidate for "Too Broad" but I'll give it a go.
Why would I want to store my data 3 times?
I have my data in my OLTP (online, transaction processing system), why would I want to move that data into a completely new structure (data warehouse) and then move it again into an OLAP system?
Let's start simple. You have only one system of record and it's not amazingly busy. Maybe you can get away with an abstraction layer (views in the database or named queries in SSAS) and skip the data warehouse.
So, you build out your cubes, dimensions and people start using it and they love it.
"You know what'd be great? If we could correlate our Blats to the Foos and Bars we already have in there" Now you need to integrate your simple app with data from a completely unrelated app. Customer id 10 in your app is customer id {ECA67697-1200-49E2-BF00-7A13A549F57D} in the CRM app. Now what? You're going to need to present a single view of the Customer to your users or they will not use the tool.
Maybe you rule with an iron fist and say No, you can't have that data in the cube and your users go along with it.
"Do people's buying habits change after having a child?" We can't answer that because our application only stores the current version of a customer. Once they have a child, they've always had a child so you can't cleanly identify patterns before or after an event.
"What were our sales like last year" We can't answer that because we only keep a rolling 12 weeks of data in the app to make it manageable.
"The data in the cubes is stale, can you refresh it?" Egads, it's the middle of the day. The SSAS processing takes table locks and would essentially bring our app down until it's done processing.
Need I go on with these scenarios?
Summary
The data warehouse serves as an integration point for diverse systems. It has conformed dimensions (everyone's has a common definition for what a thing is). The data in the warehouse may exceed the lifetime of the data in the source systems. The business needs might drive the tracking of data that the source application does not support. The data in the DW supports business activities while your OLTP system supports itself.
SSIS is just a tool for moving data. There are plenty out there, some better, some worse.
So No, generally speaking, it is not better to avoid creating a DW and build your cubes based on your OLTP database.
I am working on a project using ZK Framework, Hibernate, Spring and Mysql.
I need to generate some charts from Mysql database, but after I calculate the number of objects that I need to calculate the values of those charts I found it more than 1400 objects and same numbers of queries and transactions.
So i thought if using stored procedures in Mysql to calculate those values and save them in a separate tables (using an architecture close to Data Warehouse), and then use my web application to just read the values of those tables and display them as charts.
I want to know in terms of speed and performance, which of those methods is better?
And thank you
No way to tell, really, without many more details. However:
What you want to do is called Denormalisation. This is a recognised technique for speeding up reporting and making it easier. (If it doesn't, your denormalisation has failed!) When it works it has the following advantages:
Reports run faster
Report code is easier to write
On the other hand:
Report Data is out of date, containing only data as at the time you
last did the calculations
An extreme form of doing this is to take the OLTP database (a standard database) and export it into an Analysis database (aka a Cube or an OLAP database).
One of the problems of Denormalisation is that a) it is usually a significant effort, b) it adds extra code which adds complexity and thus increases support costs, and c) it might not make enough (or any) difference. Because of this, it is usual not to do it until you know you have a problem. This will happen when you have done your reports on the basic database and have found that they either are too difficult to write and/or run too slowly. I would strongly suggest that only when you reach that point do you go for Denormalisation.
There can be times when you don't need to do that, but I've only seen 1 such example in over 25 years of development; and that decision was helped by a desire to use an OLAP database by Management for political purposes.
http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.
I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.
I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry
I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls
I've got confused with warehousing process... I'm in a process of building a data mart but the part I don't really understand is related to cubes. I've read some tutorials about SSAS but I don't see how can I use this data in other applications. What I need is following:
A warehouse (data mart) that contains all data needed for analysis (drill down and aggregated data, like a daily revenue and YTD revenue)
A .NET web service that can take this data so that many different apps can use it
A part I don't understand are cubes. I see that many people use SSAS to build cubes. What are these cubes in SSAS? Are they objects? Are they tables where data is stored? How can my web service access the data from cubes?
Are there any alternatives to SSAS? Would it be practical to just build cubes in a data mart and load them during ETL process?
Cubes are preaggregated stores of data in a format to make reporting much more efficient than is possible in a Relational database store. In SSAS you have several choices for how your data is ultimately stored, but generally they are stored in files in the OS file system. They can be queried similarly to SQL (using a specialized query language called MDX) or by several other methods depending upon your version level. You can set up connections to the data for your web service using the appropriate drivers from Microsoft. I am unsure of what you are meaning by data mart. Are you referring to relational table in a star schema format? If so, these are generally precursors to the actual cube. You will not get as much benefit from a reporting standpoint by using these relational sources as you would from a cube (since a cube stores the aggregates of each node (or tuple) within the dimensional space defined by your star schema tables) To explain this, if I have a relational store (even in star schema format) and I want to get sales dollars for a particular location for a particular date, I have to run a query against a very large sales fact table and join the location and date dimesion tables (which may also be very large). If I want the same data from a cube, I define my cube filters and the datawarehouse query pulls that single tuple from the data and returns it much more quickly.
There are many alternatives to SSAS, but each would be a form of a cube if you are using a datawarehouse. If you have a large data set, a cube, properly designed will out perform a relational datamart for multidimensional queries.
I want to create a 'google analytics' type application for the web - i.e. a web-based tool to do some reporting and graphing for my database. The problem is that the database is HUGE, so I can't do the queries in real time because they will take too long and the tool will be unresponsive.
How can I use a cron job to help me? What is the best way to be able to make my graphs responsive? I think I will need to denomalize some of my database tables, but how do I make these queries faster? What intermediate values can I store in another database table to make it quicker?
Thanks!
Business Intelligence (BI) is a pretty mature discipline - and you'll find answers to your questions in any book on scaling databases for reporting & data warehousing.
A high-level list of tactics would include:
partitioning (because indexes are little help for most reporting)
summary tables (generated usually through a batch process submit via cron)
you need a good optimizer (some databases like mysql don't - so make poor joining decisions)
query parallelism (some databases will provide linear speedups just by splitting your query into multiple threads)
star-schema - a good data model is crucial to good performance
In general dynamic reporting beats the pants off static reporting - so if you're after powerful reporting I'd just try to copy data into an appropriate model, use aggregates, possibly change the database to get a good optimizer and the appropriate features rather than run reports in batch.