My universe has gotten bigger; there are now 8 fact tables and around 20 dimensions.
Because I have 8 fact tables, I defined 8 contexts. My assumption is that it is only possible to take objects that belong to a speciic context in order to analyse exactly those objects in one report.
Conversely, that should mean, that it is not possible to take objects that belong to different fact tables (different contexts), and analyse them in one report (by one report I refer to one table).
Is my assumption correct?
First some terminology; it might sound like nitpicking, but it avoids confusion:
Document — A single Web Intelligence document which can consist of multiple reports and data providers
Data provider — A collection of universe objects (dimensions, measures, details, …). This may result in one or more SQL statements when defined on a relational source.
Report — Represented as a tab (at the bottom) within a Web Intelligence document (comparable to a Excel worksheet vs Excel Workbook). A report can contain data from any and all data providers defined in the same document.
You can specify in the parameters of your data foundation whether or not to allow the selection of multiple contexts in the same data provider. If you allow this, selecting from multiple contexts in the same data provider will result in (at least) 1 SQL statement for each of those contexts.
If you do not allow selection of objects from multiple contexts, you will receive an error message stating Incompatible objects when you try to refresh a data provider that violates this.
See also Universe Design Tool User Guide - paragraph 5.4.7 How do Contexts Affect Queries? and specifically paragraph 5.4.7.3 Incompatible queries.
For the parameter to define the context behaviour, see Information Design Tool User Guide, paragraph 10.18 About data foundation properties. The option is called Multiple SQL statements for each context.
To answer briefly, YES it is possible to combine objects from different contexts in a single block (table or graph) and therefore also in the same report or document (refer to Kristof's clear explanation above of the different components of a WebI Document).
So NO, your assumption is not correct -- but, it depends on what kind of objects you are combining. In general, your dimensions will be shared across multiple contexts, while your measures will specific to a single context. When you build a data provider which uses multiple contexts, the common dimensions will be compatible with all of the measures, and all of the above can be displayed in the same block. Dimensions which are not shared are more complicated: A dimension only available in context A cannot be combined with a dimension only available in context B.
When you think about it (and I encourage you to play around with it and see how it works), it makes sense: as long as the dimensions are shared, you are comparing all your measures to the same thing, whether or not the underlying SQL is separated by contexts or not.
Related
I am in a project that has an infinite amount of tables, We have to come to a solution that brings scalability to the platform, and we don't seem to figure out what would be a really good one.
The platform is a job seeker, so it has two clear parts, candidates, and companies.
We've been thinking and have come to those posible solutions to re-estructure the current database, as it is a monster.
2 API's 2 Databases: This way would take a lot of database migration work, but would define very clearly the different parts of the platform.
2 API's 1 Database: Doing this, the database work would be reduced to normalize what we have now, but we would still have the two parts of the platform logically separated.
1 API 1 Database: Normalize the database, and do everything in the same API, trying to logically separate everything, making it scalable but at the same time accesible from one part to the other.
Right now I am more into the 1 API 1 Database solution, but we would like to read some experienced users to make the final choice.
Thank you!
I was in a situation kind of like yours some years ago. I will try to express my thoughts on how we handled it. All this might sound opinionated but each and every task is different, therefore the implementations are as well.
The two largest problems I notice:
Having an infinite number of tables is the first sign that your current database schema design is a Big Ball of Mud.
Acknowledging that you have a monster database indicates that you better start refactoring it to smaller pieces. Yes I know it's never easy.
It would add a lot more value to your question if you would show us some of the architectural details/parts of your codebase, so we could give better suited ideas.
Please forgive me for linking Domain Driven Design related information sources. I know that DDD is not about any technological fluff, however the strategy you need to choose is super important and I think it brings value to this post.
Know your problem domain
Before you start taking your database apart you should clearly understand how your problem domain works. To put it simply: the problem domain definition in short is the domain of the business problems you are trying to solve with the strategy you are going to apply.
Pick your strategy
The most important thing here is: the business value your strategy brings. The proposed strategy in this case is to make clear distinctions between your database objects.
Be tactical!
We chose the strategy, now we need to to define tactics applied to this refactoring. Our definition of our tactics here should be clearly set like:
Separate the related database objects that belong together, this defines explicit boundaries.
Make sure the connections between the regrouped database objects remain intact and are working. I'm talking about cross table/object references here.
Let's get technical - the database
How to break things
I personally would split up your current schema to three individual separate parts:
Candidates
Companies
Common tables
Reasoning
By strategically splitting up these database objects you consciously separate these concerns. This separation lets you have a new thing: tactical boundary.
Each of your newly separated schemas now have different contexts, and different boundaries. For example there is the Candidates schemas bounded context. It groups together business concepts/rules/etc. The same applies to the Companies schema.
The only difference is the Common tables schema. This could serve as a shared kernel -a bridge, if you like- between your other databases, containing all the shared tables that every other schema needs to reach.
Outcome
All that has been said could bring you up to a level where you can:
Backup/restore faster and more conveniently
Scale database instances separately
Easily set/monitor the access of database objects defined per schema
The API
How to glue things
This is the point where it gets really greasy, however implementing an API is really dependent on your business use case. I personally would design two different public API's.
Example
For Candidates
For Companies
The same design principles apply here as well. The only difference here is that I think there is no added business value to add an API for the Common tables. It could be just a simple database schema which both of these main API's could query or send commands to.
In my humble opinion, seperating databases results in some content management difficulties. Both of these seperate parts will contain exactly same tables like job positions, cities, business areas etc. How will you maintain these tables? Will you insert country "Zimbabwe" to both of them? What if their primary keys not equal? At some point you will need to use data from these seperated databases and which record of "Zimbabwe" will be used? I'm not talking about performance but using same database for these two project will be make life easier for you. Also we are in cloud age and you can scale your single database service/server/droplet as you want. For clearity of modules, you can define your naming conventions. For example if table is used by both parts, add prefix "common_", if table only used by candidates use "candidate_" etc.
For API, you can use same methodology, too. Define 3 different API part. Common, candidates and companies. But in this way, you should code well-tested authentication and authorization layer for your API.
If I were you, I'd choose the 1 API, 1 Database.
If it fails, seperating 1 API to 2 API or 1 Database to 2 Database is much easier then merging them (humble opinion...)
I am currently working on a web-application that would allow users to analyze & visualize data. For example, one of the use-cases is that the user will perform a Principal Component Analysis and store it. There can be other such analysis like a volcano plot, heatmap etc.
I would like to store these analysis and visualizations in a database in the back-end. The challenge that I am facing is how to design a relational database schema which will do this efficiently. Here are some of my concerns:
The data associated with the project will already be stored in a normalized manner so that it can be recalled. I would not like to store it again with the visualization.
At the same time, the user should be able to see what is the original data behind a visualization. For eg. what data was fed to a PCA algorithm? The user might not use all the data associated with the project for the PCA. He/she could just be doing this on a subset of the data in the project.
The number of visualizations associated with the webapp will grow with time. If I need to design an invoved schema everytime a new visualization is added, it could make overall development slower.
With these in mind, I am wondering if I should try to solve this with a relational database like MySQL at all. Or should I look at MongoDB? More generally, how do I think about this problem? I tried looking for some blogs/tutorials online but couldn't find much that was useful.
The first step you should do before thinking about technical design, including a relational or non-SQL platform, is a data model that clearly describes the structure and relations between your data in a platform independent way. I see the following interesting points to solve there:
How is a visualisation related to the data objects it visualizes? When the visualisation just displays the data of one object type (let's say the number of sales per month), this is trivial. But if it covers more than one object type (the number of sales per month, product category, and country), you will have to decide to which of them to link it. There is no single correct solution for this, but it depends on the requirements from the users' view: From which origins will they come to find this visualisation? If they always come from the same origin (let's say the country), it will be enough to link the visuals to that object type.
How will you handle insertions, deletes, and updates of the basic data since the point in time the visualisation has been generated? If no such operations relevant to the visuals are possible, then it's easy: Just store the selection criteria (country = "Austria", product category = "Toys") with the visual, and everyone will know its meaning. If, however, the basic data can be changed, you should implement a data model that covers historizing those data, i.e. being able to reconstruct the data values on which the original visual was based. Of course, before deciding on this, you need to clarify the requirements: Will, in case of changed basic data, the original visual still be of interest or will it need to be re-generated to reflect the changes?
Both questions are neither simplified nor complicated by using a NOSQL database.
No matter what the outcome of those requirements and data modeling efforts are, I would stick to the following principles:
Separate the visuals from the basic data, even if a visual is closely related to just one set of basic data. Reason: The visuals are just a consequence of the basic data that can be re-calculated in case they get lost. So the requirements e.g. for data backup will be more strict for the basic data than for the visuals.
Don't store basic data redundantly to show the basis for each single visual. A timestamp logic with each record of basic data, together with the timestamp of the generated visual will serve the same purpose with less effort and storage volume.
For a site that is using a Sandbox mode, such as a Payment site, would a separate database be used, or the same one?
I am examining two schemas for the production and sandbox environment. Here are the two options.
OPTION 1:
Clone database, route requests to the correct database based upon sandbox mode.
OPTION 2
Single database, 'main tables' have an is_sandbox boolean.
What would be the pros and cons of each method?
In most situations, you'd want to keep two separate databases. There's no good reason to have the two intermingled in the same database, and a lot of very good reasons to keep them separated:
Keeping track of which entities are in which "realm" (production vs. sandbox) is extra work for your code, and you'll likely have to include it in a lot of places.
You'll need that logic in your database schema as well. UNIQUE indexes all have to include the realm, for instance.
If you forget any of that code, you've got yourself a potential security vulnerability. A malicious user could cause data from one realm to influence the other. Depending on what your application is, this could range anywhere from annoying to terrifying. (If it's a payment application, for instance, the potential consequences are incredibly dire: "pretend" money from the sandbox could be converted into real money!)
Even if your code is all perfect, there'll still be some information unavoidably leaked between the realms. For instance, if your application uses any sequential identifiers (AUTO_INCREMENT in MySQL, for instance), gaps in values seen in the sandbox will correspond with values used in production. Whether this matters is debatable, though.
Using two separate databases neatly solves all these problems. It also means you can easily clean out the sandbox when needed.
Exception: If your application is an almost entirely public web site (e.g, like Stack Overflow or Wikipedia), or involves social aspects that are difficult to replicate in a sandbox (like Facebook), more integrated sandboxes may make more sense.
I need help in designing SSIS packages for our DWH load.
I have two star-schema model with following details
1st Model --> 5 dimension and 5 fact tables respectively
2nd Model --> 5 dimension and 1 fact tables respectively
I have five different source system from where I need to populate data into these tables.
Based on the above requirements I have thought of designing the package like this:
there will three packages and which will do the following:
First package will extract the data from source systems to staging table (SQL Server tables) with all the necessary transformation.
Second package will load the data to all the dimension tables.
Third package will load the data to all the fact tables.
Please let me know if the above design/architecture will work for this DWH load or do I need to do some modification.
This is quite hard to answer because ultimately if it works then it is correct. There are only varying degrees of "correctness" (is that a word?) or answers which are more (or less) elegant depending on your viewpoint.
However, as a general rule and speaking for myself I have always found it more elegant to load the data into a staging area and then distribute Dimensions and then Facts via Procedures. The work is then performed within the context of the target database and not by the package. The package acts to control the flow.
Also, I would avoid splitting the tasks into more than one package unnecessarily. Of course there may be other considerations that may affect this decision. E.g. Multiple data updates arriving from different sources at different times, but even then I would tend to stage and update at once.
Consider a hierarchical file system where each folder maintains a version history (i.e. name and additional properties may change). I need to implement this in MySQL 5.1 but a future version may be ported to SQL Server 2012.
I understand there are several options for tree-structures in databases:
Adjacency List
Nested Set (may cause extremely slow insertions)
Nested intervals (complex stuff, requires support for recursion...)
These techniques have been discussed on StackOverflow before. However, my problem adds another dimension to the problem as I need to maintain a history for each node. The data that needs to be maintained can be seen as a list of properties. E.g. Name, date, type...
Some premises
The database is expected to handle 5-10 simultaneous clients.
The tree is expected to grow up to a 1000-5000 parent nodes (with an arbitrary number of leafs).
Nodes may be inserted at any time.
Nodes/leafs may never be updated or deleted. Insted, a version history is maintained.
Reorganization of nodes is not permitted. (Though, if possible, this would be nice to have!)
Multiple clients may simultaneously add/modify tree nodes. Hence, the clients need to continuously re-read the tree structure (no need for real-time updates).
Order of importance: Traceability (crucial), performance, scalability.
Q: What is the preferred technique of choice for the tree structure and its version controlled node data? SQL samples are appreciated, but not mandatory.
Versioning is extremely tricky because you are dealing with time varying data and neither of the databases you have suggested (or any others that I am aware of) have native support for doing this simply.
Please have a read of Developing Time-Oriented Database Applications in SQL; the book may be almost 15 years old but the issues are largely unchanged.
The fact that you mention "Traceability (crucial)" suggests that you are going to want to get this right.
When considering a simple report to show just the heirachy, the issues you need to consider are do you need to know:
what the tree looks like today, using today's data (yes, obviously)
what the tree looks like now, using last week's data
what the tree looks like a week ago, using the today's data
what the tree looks like a week ago, using last week's data
what the tree looks like a week ago, using the week before last's data
The issues you face are because you are dealing with time varying data, that is updated at a different time from the real-world process it is modeling, that in itself may be dealing with temporal data. Anyway, read the book.
If this is a non-issue (i.e. the tree is static), then #didierc is correct in his comment, the nodes of the tree can refer to an external versioning table. However, if you also need to store versioning information about the heirachy itself this approach won't work if implemented naively (using whatever model).
To give a concrete example, consider a simple tree valid at 1/1/13 - A->B->C. If this changes on 1/2/13 to A->D->B->C. If you run a query on 1/3/13, referring back to 1/2/13, which tree do you want to retrieve?
Good luck