I have a task that requires me to use historical data from at least 5 years, to create a decision making model. Currently I'm using MySQL Foodmart database for mondrian, however it only covers one year.
I tried to look for some alternative database with wider time range, but with no luck.
Does anyone know of any alternative db that I could use?
There are plenty of datasets out there freely available to use. You can search for AdventureWorks, or the standard TPC-DS dataset.
For a quick start, I suggest you buy the Mondrian book and download its VM, which contains everything already installed along with the AdventureWorks dataset and a schema for mondrian.
Related
I'm designing a data warehouse and I need a tool that can generate the Date Dimension. I'm using MySQL 5.x.
Thanks
You didn't specify if you are using a third-party ETL application or rolling your own.
If you are using a third-party application, there will most likely be a widget or function available to help generate your date dimension. For example, using the Pentaho Data Integration toolkit, see this excellent article from the O'Reilly Databases Blog: http://www.oreillynet.com/databases/blog/2007/09/kettle_tip_using_java_locales.html
If you are rolling your own, it is a pretty simple exercise to generate every date between two given dates. A stored procedure is going to be more performant, but writing the function in the language you are implementing ETL with will be more maintainable. The helpful links posted by #hafichuk are good examples of how to do the generation in stored procedures. Since you are designing the schema, you will have to write your own procedure which conforms with your definition of the date dimension, or at least modify those ones.
Finally, make sure to give yourself flexibility with the solution you choose - even though a date will only be generated once per production instance building of the "world", there will be a lot of other times when the same date dimension generation code will need to be used. (test runs, demo/staging deployments, in integration test suites...) Thus, it needs to be fast enough and/or flexible enough so that it is not a bottleneck. Generating your date dimension in the ETL language and doing so at the beginning of every integration test for the whole range of applicable dates will get old, fast.
You should definitely check out this free date dimension feed on the Azure Data Market.
It's a date table feed designed for import into an Excel PowerPivot model, but can easily be used for other destinations as well (for example like a MySQL table).
The table contains columns particularly suitable for time business intelligence applications, and contains generic English language columns helping with creating a basic, all-purpose date dimension.
It also contains several other languages (including US, Hebrew, Danish, German and Bulgarian, and other languages are on the way).
I have so many tables and FK relationships it is hard to keep track and visualze it all. Speard across multiple excel documents. I have it already entered in mysql but I want to output a data model diagram that links tables together along with all the FKs.
How to do it apart from manually having to do it? I am open to 3rd part tools as long as they are free.
Well I am using phpMyAdmin on my local server.
phpMyAdmin 3 has this Designer feature that shows you the linkages between various tables and its columns.
Take a look at the MySQL Workbench.
It's a free tool and offers a few nice features like forward and backward engineering and database synchronisation. I has a few bugs, but its the best tool for MySQL I know so far.
Basically have many huge delimited files that I know I can import as a table, but I need to map that data to an existing rational multi-table MySQL database. There should not be any conflict with datatypes, but I'm super new to this, so please point out anything I should be watching for. Clearly I'm not going to run this in production either until I know it works.
Not 100% sure stackoverflow is the right place to ask a database question, but I couldn't find any other Stack Exchange that was a better fit.
Posted this question on SuperUser looking for a GUI to do this, but I up for coding this is it gets the job done. As such there is no target language, just the requirement that the database be MySQL.
Also, found this stackoverflow Q/A that deals with MS-SQL's SSIS (which I'm not planning on using due to cost, but the content and issues faced are of the same nature it appears.) --
Loading Multiple Tables using SSIS keeping foreign key relationships
I'd suggest using the ETL(extract translate load) tool from the Pentaho Business Intelligence package. It's got a bit of a learning curve but it'll do exactly what you're looking for. Their ETL tool is called Kettle and it's extremely powerful once you get the hang of it.
There are two versions of Pentaho, an enterprise version that has a free trial, and a free community version. The community version is more than capable but you might give the enterprise version a test ride too.
Here's some links
Pentaho Community Edition Site
Kettle Site
Pentaho Enterprise Site
Update: Multiple table outputs
One of the key steps in your transformation is going to be a combination lookup-update. This step checks a given table to see if a record from your data-stream exists and inserts a new record if it does not. Regardless of whether it's a new or old record it's going to append the key field from that record into your data-stream. As you keep going you'll use these keys as foreign keys as you import data into related tables.
We are currently having an OLTP sql server 2005 database for our project. We are planning to build a separate reporting database(de-normalized) so that we can take the load off from our OLTP DB. I'm not quite sure which is the best approach to sync these databases. We are not looking for a real-time system though. Is SSIS a good option? I'm completely new to SSIS, so not sure about the feasibility. Kindly provide your inputs.
Everyone has there own opinion of SSIS. But I have used it for years for datamarts and my current environment which is a full BI installation. I personally love its capabilities to move data and it still is holding the world record for moving 1.13 terabytes in under 30 minutes.
As for setup we use log shipping from our transactional DB to populate a 2nd box. Then use SSIS to de-normalize and warehouse the data. The community for SSIS is also very large and there are tons of free training and helpful resources online.
We build our data warehouse using SSIS from which we run reports. Its a big learning curve and the errors it throws aren't particularly useful, and it helps to be good at SQL, rather than treating it as a 'row by row transfer' - what I mean is you should be creating set based queries in sql command tasks rather than using lots of SSIS component and dataflow tasks.
Understand that every warehouse is difference and you need to decide how to do it best. This link may give you some good idea's.
How we implement ours (we have a postgres backend and use PGNP provider, and making use of linked servers could make your life easier ):
First of all you need to have a time-stamp column in each table so you can when it was last changed.
Then write a query that selects the data that has changed since you last ran the package (using an audit table would help) and get that data into a staging table. We run this as a dataflow task as (using postgres) we don't have any other choice, although you may be able to make use of a normal reference to another database (dbname.schemaname.tablename or somthing like that) or use a linked server query. Either way the idea is the same. You end up with data that has change since your query.
We then update (based on id) the data that already exists then insert the new data (by left joining the table to find out what doesn't already exist in the current warehouse).
So now we have one denormalised table that show in this case jobs per day. From this we calculate other tables based on aggregated values from this one.
Hope that helps, here are some good links that I found useful:
Choosing .Net or SSIS
SSIS Talk
Package Configurations
Improving the Performance of the Data Flow
Trnsformations
Custom Logging / Good Blog
I have a database (MySql) and need to store some results from a web service monthly.
The data can have 10 results today but may have 200 next month.
I need to use a BI tool to create charts and what not.
Someone proposed to serialize the data and save the blobs in the database, while the solution seems to work, I have a gut feeling that when the time comes to hook it up with the BI tool, hell will break loose.
Has anyone had this issue before?
Thanks
Edit: adding extra info.
The problem is that we haven't chosen the BI tool yet. But what it needs to do is create charts for the results. Some of the results come from Google Analytics. So we will be charting number of visitors to a site for the last 6 months. Or Number of viewed pages.
The answer is simple: do not store Serialized data in a database.
Do some research, atomize your data and create data structure.
Once you've done it, you will be able to use any BI tool in the world.
That's the purpose of a database and what distinguishes a database from a flat file.