Date Dimension Generator for MySQL

Date Dimension Generator for MySQL - mysql

I'm designing a data warehouse and I need a tool that can generate the Date Dimension. I'm using MySQL 5.x.
Thanks

You didn't specify if you are using a third-party ETL application or rolling your own.
If you are using a third-party application, there will most likely be a widget or function available to help generate your date dimension. For example, using the Pentaho Data Integration toolkit, see this excellent article from the O'Reilly Databases Blog: http://www.oreillynet.com/databases/blog/2007/09/kettle_tip_using_java_locales.html
If you are rolling your own, it is a pretty simple exercise to generate every date between two given dates. A stored procedure is going to be more performant, but writing the function in the language you are implementing ETL with will be more maintainable. The helpful links posted by #hafichuk are good examples of how to do the generation in stored procedures. Since you are designing the schema, you will have to write your own procedure which conforms with your definition of the date dimension, or at least modify those ones.
Finally, make sure to give yourself flexibility with the solution you choose - even though a date will only be generated once per production instance building of the "world", there will be a lot of other times when the same date dimension generation code will need to be used. (test runs, demo/staging deployments, in integration test suites...) Thus, it needs to be fast enough and/or flexible enough so that it is not a bottleneck. Generating your date dimension in the ETL language and doing so at the beginning of every integration test for the whole range of applicable dates will get old, fast.

You should definitely check out this free date dimension feed on the Azure Data Market.
It's a date table feed designed for import into an Excel PowerPivot model, but can easily be used for other destinations as well (for example like a MySQL table).
The table contains columns particularly suitable for time business intelligence applications, and contains generic English language columns helping with creating a basic, all-purpose date dimension.
It also contains several other languages (including US, Hebrew, Danish, German and Bulgarian, and other languages are on the way).

Related

Which database can be used to store processed data from NLP engine

I am looking at taking unstructured data in the form of files, processing it and storing it in a database for retrieval.
The data will be in natural language and the queries to get information will also be in natural language.
Ex: the data could be "Roses are red" and the query could be "What is the color of a rose?"
I have looked at several nlp systems, focusing more on open-source information extraction and relation extraction system and the following seems apt and easy for quick start:
https://www.npmjs.com/package/mitie
This can give data in the form of (word,type) pairs. It also gives a relation as result of running the the processing (check the site example).
I want to know if sql is good database to save this information. For retrieving the information, I will need to convert the natural language query also to some kind of (word, meaning) pairs
and for using sql I will have to write a layer that converts natural language to sql queries.
Please suggest if there are any open source database that work well in this situation. I'm open to suggestions for databases that work with other open-source information extraction and relation extraction systems if not MITIE.

SQL wont be an appropriate choice for your problem. You can use NLP or rules to extract relationships and then store that relationship in a Triple Store or a Graph database. There are many good open source Graph Databases like Neo4j and Apache Titan. You can query Google for Triple-stores, I suppose Apache Jena should be a good choice. After storing your data you can query your graphs using any of the Graph Query Languages like Gremlin or Cypher etc. (like SQL). Note that the heart of your system would be a Knowledge Graph.
You may also setup a Lucene/Solr based Search System on your unstructured data which may help you with answering your queries in conjunction with Graph Databases. All of these (NLP, IR, Graph DB/Triplestores etc.) would coexist to solve your problem.
It would be like an ensemble. No silver bullets :) However to start with look at Graph DB's or Triple-stores.

How to convert dimensional DB model to datamining friendly layout?

My problem is, I have a Dimensional Model DB NFL league. So we have Players, Teams, Leagues as the dimension tables and Match as the factual table relates these tables. For instance, if I need to query stats of a player in a particular match or a range of matches, it is very painstaking SQL query with lots of joins to convert machine readable ID based tables to human readable name based version. In addition, analysis of that data is also very painful. For being a solution, I suggest to transform that DB to Analysis friendly version. Again for example, Player table ll include players at each row with related stats and same for Teams as well.
The question is, is there any framework, method or schema that might guide me to design the analysis friendly DB layout. Also still the use of SQL is favorable or any non-sql DB is better for this problem?
I know it sounds very general question but I just want to hear some expertise about the topic. Therefore, any help, suggestion is very welcome.

I was in a team faced with a similar situation about 13 years ago. We used a tool called "PowerPlay", a Business Intelligence tool from Cognos. This tool was very friendly to the data analysts, with drill down capabilities, and all sorts of name based searching.
If I recall correctly (it's been a while), The BI tool stored the data in its own format (a data cube) but it had its own tool for automatically discovering the structure of an SQL based data source. That automatic tool was really struggling with the OLTP database, which was SQL (Oracle) and which was a real mess... a terrible relational design.
So what I ended up doing was building a star schema to collect and organize the same data, but more compatible with a multidimensional view of the data. I then built the ETL stuff to load the star from the OLTP database. The BI tool cut through the star schema like a hot knife through butter.
And the analysts didn't have to mess with ID fields at all.
It sounds like your starting place is like the star schema I had to build. So I would suggest that there are BI tools out there that you can lay on top of your star and that will provide precisely the kind of analyst friendly environment you are looking for. Cognos is only one of many vendors of BI tools.
A few caveats: If you go this way, you have to make an effort to make sure your name fields "make sense" if they are going to provide meaningful guidance to the analysts trying to drill down or search. Sometimes original data sources treat name fields as more or less meaningless stuff, where errors don't matter much. The same goes for column names. Column names that DBAs like are often gibberish to data analysts. You may also have to flatten any hierarchical groupings in your dimension tables, but you may have already done this. It depends on what your BI tool needs.
Hope this helps, even if it's a little generic.

Mondrian database with wide time range

I have a task that requires me to use historical data from at least 5 years, to create a decision making model. Currently I'm using MySQL Foodmart database for mondrian, however it only covers one year.
I tried to look for some alternative database with wider time range, but with no luck.
Does anyone know of any alternative db that I could use?

There are plenty of datasets out there freely available to use. You can search for AdventureWorks, or the standard TPC-DS dataset.
For a quick start, I suggest you buy the Mondrian book and download its VM, which contains everything already installed along with the AdventureWorks dataset and a schema for mondrian.

What can I do to trace what a program does, not having the source code and the support from the program supplier

I have now to deal with a program called FDT whose support is no longer taken by the company I am working for but still using the same program. Now I need to insert new orders into the program from the site which I can get in xml, csv or some other from magento. I am trying to automate this process. All work in the office are done on the basis of this software FDT like checking the out of stock, bills printing and others.
I am now thinking to use profiler to trace events. I would like to know what processing does the program do when we place some order in it. I am not a good user of Profiler, I would like some suggestions if it is possible know what tables it effects, what columns it updates or writes to.
Above it is a new order no. the program generates. which is a unique id and is integer. I am not able to know the pattern. I do have a test server where I can make changes and trial and error is no problem.
Some suggestions on how shall I proceed or at least start going on would be appreciated.
I think most important would be to trace the T-sql but again which events and what filter to use?
I am sorry if it a stupid question, I am trying to learn .. source code and support is not an option.

This question has too many parts- how to do trace, how to deal with an application post-support-contract, how to reverse engineer an app and even if that is a good idea (and sometimes it's the only idea available) I'd re-ask this as a series of narrow technical question or ask it on Programmers (after reading their FAQ they only like certain questions)
Yup, been there done that. In large organizations, normally these tasks fall to technies who don't weild the awesome power of the budget and can't personal go negotiate a new contract with the original vendor. I assume you have food bills to pay and can't tell your supervisor, "well, I ain't do doing nothing until we get a support contract"
Step 0 Diagram the tables - work out the entity relationships and assembly a data dictionary (one that explains the motivation of each table and column, not just the name and data type)
Step 1 Attach the profiler to an active instance of SQL 2008. If you have a specific question about SQL Profiler, open a new question. One hint-- if you are attached to a multi-user instance, filter down to just your own user (the one in the connection string)
http://blog.sqlauthority.com/2009/08/03/sql-server-introduction-to-sql-server-2008-profiler-2/
Step 2
Do an action in the application and watch what SQL was emitted. If it is SQL, you can copy and paste it to Management studio so you can diagram the query and run your own test executions. If it is a stored proc, you go read the source code of the stored procedure. If the stored procedure is encrypted, it may or may not be possible to decrypt it. Scenarios when decrypting the code is fairly defensible is when you aren't redistributing it and the supporting company isn't there.
Step 3
Once you understand the app, you can write reports, or more likey, you want to record either new transactions or old transactions differently.
If the app is written in .net or java, you can decompile it and read the code. Creating a custom build from that source isn't going to be fun. A more likely thing to happen is you will create an application that targets the same tables or possibly export all the data out of the original app and into a new bespoke one.

using couch db and sql server side by side

We currently have a nicely relational sql server 2008 database that is our master application database. We are looking to improve an existing document storage mechanism which uses xml data types with something more schemaless that can handle similar but not identical documents and thought that couchdb would be good fit.
The idea is that the common meta data about the documents could be stored within sql server for ease of display/aggregation/reporting but the actual documents are stored in couch to handle the subtle differences in the documents. The idea is to make the most of the two different technologies.
For example the status, type, related person and date created would all be common across all documents and stored in sql but an email and a letter (obviously with different fields) could be stored in couch.
Then we can display our document grid for all types of document (thousands of docs) which can be queried through sql but the display of the doc gets its data from couch when the user requests to view it.
Something to bear in mind is that some document types are generated from templates that are also documents themselves (think mail merge/find and replace).
Application layer is asp.net 4.5, c#, repository pattern, Windsor for ioc, JavaScript
So, to the question...
Is this approach a sensible way to make the most of the two differing data storage paradigms?
Are we making our programming lives needlessly complex in the desire to "use the most appropriate technology for the problem"?
Does anyone have any experiences of trying something similar and if so, how did it go?

It's really not uncommon to use two different storage formats for a document: One for searchable aspects and metadata and another for presentation.
Looking at it in a more general way, the approach is somewhat similar to the one we developed at the Royal Danish Library and pushed in the Planets EU project:
http://www.researchgate.net/publication/221176211_Archive_Design_Based_on_Planets_Inspired_Logical_Object_Model
Here's another paper that discusses this approach in a more general way:
"Opening Schrödingers Library"
The goal was archiving. We recognized that when converting documents for archiving or preservation no sigle storage format was superior in all aspects of preserving the attributes, formats, looks, contents etc of the original document. Solution: Convert to several formats, and use a sophisticated digital object to track the conversions, and which aspects of the original were best preserved in which conversion.
So in my opinion the approach is theoretically and practically sound.
Practical issues: You will probably need some sort of digital object that keeps track of the various parts of a document, eg. whether it occurs in one system only (and so which one), or in both. It seems that you are going to use SQLserver for this aspect, and that sounds sensible.
We actually did implement the object model we describe in the paper, and last I hear they are still using it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008