How to convert dimensional DB model to datamining friendly layout? - mysql

My problem is, I have a Dimensional Model DB NFL league. So we have Players, Teams, Leagues as the dimension tables and Match as the factual table relates these tables. For instance, if I need to query stats of a player in a particular match or a range of matches, it is very painstaking SQL query with lots of joins to convert machine readable ID based tables to human readable name based version. In addition, analysis of that data is also very painful. For being a solution, I suggest to transform that DB to Analysis friendly version. Again for example, Player table ll include players at each row with related stats and same for Teams as well.
The question is, is there any framework, method or schema that might guide me to design the analysis friendly DB layout. Also still the use of SQL is favorable or any non-sql DB is better for this problem?
I know it sounds very general question but I just want to hear some expertise about the topic. Therefore, any help, suggestion is very welcome.

I was in a team faced with a similar situation about 13 years ago. We used a tool called "PowerPlay", a Business Intelligence tool from Cognos. This tool was very friendly to the data analysts, with drill down capabilities, and all sorts of name based searching.
If I recall correctly (it's been a while), The BI tool stored the data in its own format (a data cube) but it had its own tool for automatically discovering the structure of an SQL based data source. That automatic tool was really struggling with the OLTP database, which was SQL (Oracle) and which was a real mess... a terrible relational design.
So what I ended up doing was building a star schema to collect and organize the same data, but more compatible with a multidimensional view of the data. I then built the ETL stuff to load the star from the OLTP database. The BI tool cut through the star schema like a hot knife through butter.
And the analysts didn't have to mess with ID fields at all.
It sounds like your starting place is like the star schema I had to build. So I would suggest that there are BI tools out there that you can lay on top of your star and that will provide precisely the kind of analyst friendly environment you are looking for. Cognos is only one of many vendors of BI tools.
A few caveats: If you go this way, you have to make an effort to make sure your name fields "make sense" if they are going to provide meaningful guidance to the analysts trying to drill down or search. Sometimes original data sources treat name fields as more or less meaningless stuff, where errors don't matter much. The same goes for column names. Column names that DBAs like are often gibberish to data analysts. You may also have to flatten any hierarchical groupings in your dimension tables, but you may have already done this. It depends on what your BI tool needs.
Hope this helps, even if it's a little generic.

Related

Best practice to use several APIs or data sources for one application

I want to build an application that uses data from several endpoints.
Lets say I have:
JSON API for getting cinema data
XML Export for getting data about ???
Another JSON API for something else
A csv-file for some more shit ...
In my application I want to bring all this data together and build views for it and so on ...
MY idea was to set up a database by create schemas for all these data sources, so I can do some kind of "import scripts" which I can call whenever I want to get the latest data.
I thought of schemas because I want to be able to easily adept a new API with any kind of schema.
Please enlighten me of the possibilities and best practices out there (theory and practice if possible :P)
You are totally right on making a database. But the real problem is probably not going to be how to store your data. It's going to be how to make it fit together logically and semantically.
I suggest you first take a good look at what your enpoints can provide. Get several samples from every source and analyze them if you can. How will you know which data is new? How can you match it against existing data and against data from other sources? If existing data changes or gets deleted, how will you detect and handle that? What if sources disagree on something? How and when should you run the synchronization? What will you do if one of your sources goes down? Etc.
It is extremely difficult to make data consistent if your data sources are not. As a rule, if the sources are different, they are not consistent. Thus the proverb "garbage in, garbage out". We, humans, have no problem dealing with small inconsistencies, but algorithms cannot work correctly if there are discrepancies. Even if everything fits together on paper, one usually forgets that data can change over time...
At least that's my experience in such cases.
I'm not sure if in the application you want to display all the data in the same view or if you are going to be creating different views for each of the sources. If you want to display the data in the same view, like a grid, I would recommend using inheritance or an interface depending on your data and needs. I would recommend setting this structure up in the database too using different tables for the different sources and having a parent table related to all them that has a type associated with it.
Here's a good thread with discussion about choosing an interface or inheritance.
Inheritance vs. interface in C#
And here are some examples of representing inheritance in a database.
How can you represent inheritance in a database?

Which database can be used to store processed data from NLP engine

I am looking at taking unstructured data in the form of files, processing it and storing it in a database for retrieval.
The data will be in natural language and the queries to get information will also be in natural language.
Ex: the data could be "Roses are red" and the query could be "What is the color of a rose?"
I have looked at several nlp systems, focusing more on open-source information extraction and relation extraction system and the following seems apt and easy for quick start:
https://www.npmjs.com/package/mitie
This can give data in the form of (word,type) pairs. It also gives a relation as result of running the the processing (check the site example).
I want to know if sql is good database to save this information. For retrieving the information, I will need to convert the natural language query also to some kind of (word, meaning) pairs
and for using sql I will have to write a layer that converts natural language to sql queries.
Please suggest if there are any open source database that work well in this situation. I'm open to suggestions for databases that work with other open-source information extraction and relation extraction systems if not MITIE.
SQL wont be an appropriate choice for your problem. You can use NLP or rules to extract relationships and then store that relationship in a Triple Store or a Graph database. There are many good open source Graph Databases like Neo4j and Apache Titan. You can query Google for Triple-stores, I suppose Apache Jena should be a good choice. After storing your data you can query your graphs using any of the Graph Query Languages like Gremlin or Cypher etc. (like SQL). Note that the heart of your system would be a Knowledge Graph.
You may also setup a Lucene/Solr based Search System on your unstructured data which may help you with answering your queries in conjunction with Graph Databases. All of these (NLP, IR, Graph DB/Triplestores etc.) would coexist to solve your problem.
It would be like an ensemble. No silver bullets :) However to start with look at Graph DB's or Triple-stores.

How do I set up the architecture for a "big data" analysis project?

A friend of mine and I are in our senior year and will be starting a senior project soon. We had the idea to do a data analysis and data visualization project for it. Our project involves reading a CSV file that is updated every 2 minutes, parsing that data, then storing it in a database. Once that data is stored we want to run some analysis on it and provide an API through which we could access that data to visualize in some way. Our end goal would be to build an Android app that displays some of the raw data from the CSV and the analysis in a user friendly format. I talked to another CS Major and he explained that I would need a few different servers to accomplish this: One for the storage, another for analysis, and another for some type of queue that would make sure things don't get screwy while we are doing scraping and analysis. The problem is, I don't really know where to start with this. I've done some work with a SQL database before and a PHP front end, but nothing with multiple servers. I've heard of tools to use with big data projects like Hadoop but i'm not exactly sure where it fits in. If someone could point me to a resource of some kind to explain, or explain themselves, how I would start to structure this kind of project, that would be awesome!
Since you don't have much experience with these things you'll probably want to look at projects like Cloudera. Specifically their resources page has a nice set of videos and articles.
Another source of solid information (that I personally use) is by clicking on an Stack Overflow tag and selecting the votes option. Many good questions on a plethora of big data topics already exists.

using couch db and sql server side by side

We currently have a nicely relational sql server 2008 database that is our master application database. We are looking to improve an existing document storage mechanism which uses xml data types with something more schemaless that can handle similar but not identical documents and thought that couchdb would be good fit.
The idea is that the common meta data about the documents could be stored within sql server for ease of display/aggregation/reporting but the actual documents are stored in couch to handle the subtle differences in the documents. The idea is to make the most of the two different technologies.
For example the status, type, related person and date created would all be common across all documents and stored in sql but an email and a letter (obviously with different fields) could be stored in couch.
Then we can display our document grid for all types of document (thousands of docs) which can be queried through sql but the display of the doc gets its data from couch when the user requests to view it.
Something to bear in mind is that some document types are generated from templates that are also documents themselves (think mail merge/find and replace).
Application layer is asp.net 4.5, c#, repository pattern, Windsor for ioc, JavaScript
So, to the question...
Is this approach a sensible way to make the most of the two differing data storage paradigms?
Are we making our programming lives needlessly complex in the desire to "use the most appropriate technology for the problem"?
Does anyone have any experiences of trying something similar and if so, how did it go?
It's really not uncommon to use two different storage formats for a document: One for searchable aspects and metadata and another for presentation.
Looking at it in a more general way, the approach is somewhat similar to the one we developed at the Royal Danish Library and pushed in the Planets EU project:
http://www.researchgate.net/publication/221176211_Archive_Design_Based_on_Planets_Inspired_Logical_Object_Model
Here's another paper that discusses this approach in a more general way:
"Opening Schrödingers Library"
The goal was archiving. We recognized that when converting documents for archiving or preservation no sigle storage format was superior in all aspects of preserving the attributes, formats, looks, contents etc of the original document. Solution: Convert to several formats, and use a sophisticated digital object to track the conversions, and which aspects of the original were best preserved in which conversion.
So in my opinion the approach is theoretically and practically sound.
Practical issues: You will probably need some sort of digital object that keeps track of the various parts of a document, eg. whether it occurs in one system only (and so which one), or in both. It seems that you are going to use SQLserver for this aspect, and that sounds sensible.
We actually did implement the object model we describe in the paper, and last I hear they are still using it.

Date Dimension Generator for MySQL

I'm designing a data warehouse and I need a tool that can generate the Date Dimension. I'm using MySQL 5.x.
Thanks
You didn't specify if you are using a third-party ETL application or rolling your own.
If you are using a third-party application, there will most likely be a widget or function available to help generate your date dimension. For example, using the Pentaho Data Integration toolkit, see this excellent article from the O'Reilly Databases Blog: http://www.oreillynet.com/databases/blog/2007/09/kettle_tip_using_java_locales.html
If you are rolling your own, it is a pretty simple exercise to generate every date between two given dates. A stored procedure is going to be more performant, but writing the function in the language you are implementing ETL with will be more maintainable. The helpful links posted by #hafichuk are good examples of how to do the generation in stored procedures. Since you are designing the schema, you will have to write your own procedure which conforms with your definition of the date dimension, or at least modify those ones.
Finally, make sure to give yourself flexibility with the solution you choose - even though a date will only be generated once per production instance building of the "world", there will be a lot of other times when the same date dimension generation code will need to be used. (test runs, demo/staging deployments, in integration test suites...) Thus, it needs to be fast enough and/or flexible enough so that it is not a bottleneck. Generating your date dimension in the ETL language and doing so at the beginning of every integration test for the whole range of applicable dates will get old, fast.
You should definitely check out this free date dimension feed on the Azure Data Market.
It's a date table feed designed for import into an Excel PowerPivot model, but can easily be used for other destinations as well (for example like a MySQL table).
The table contains columns particularly suitable for time business intelligence applications, and contains generic English language columns helping with creating a basic, all-purpose date dimension.
It also contains several other languages (including US, Hebrew, Danish, German and Bulgarian, and other languages are on the way).