I recently ran into a problem in our SQL Server 2008 Analysis Services Cube. Imagine you have a simple sales data warehouse with orders and products. Each order can be associated with several products, and each product can be contained in several orders. So the data warehouse consists out of at least 3 tables: One for the Products, one for the Orders and one for the reference table, modelling the n:n relationship between both.
The question I want our cube to answer is: How many orders are there which contain both product x and product y?
In SQL, this is easy:
select orderid from dbo.OrderRefProduct
where ProductID = 1
intersect
select orderid from dbo.OrderRefProduct
where ProductID = 3
Since I am fairly proficient in SQL, but a newbie in MDX, I have been unable to implement that in MDX. I have tried using distinct count measures, the MDX-functions intersect and nonempty and subcubes. I also tried duplicating the dimensions logically (by adding the dimension to the cube twice) as well as physically (by duplicating the data source table and the dimension).
On http://www.zeitz.net/thts/intersection.zip, you can download a zip file of 25kB size which contains an SQL script with some test data and the Analysis Services Solution using the tables.
We are using SQL Server 2008 R2 and its Analysis Services counterpart. Performance considerations are not that important, as the data volume is rather low (millions of rows) compared to the other measure groups included in that cube (billions of rows).
The ultimate goal would be to be able to use the desired functionality in standard OLAP (custom calculated measures are ok), since Excel is our primary frontend, and our customers would like to choose their Products from the dimension list and get the correct result in the cube measures. But even a working standalone MDX-Query would greatly help.
Thank you!
Edit March 12th
Did I miss something or can't this be solved somehow?
If it helps to build the mdx, here is another way to get the results in sql, using subquerys. It can be further nested.
select distinct b.orderid from
(
select distinct orderid from dbo.OrderRefProduct
where ProductID = 1
) a
join dbo.OrderRefProduct b on (a.orderid = b.orderid)
where ProductID = 3
I tried something like this with subcubes in mdx, but didn't manage to succeed.
I've had a go - you can download my solution from here:
http://sdrv.ms/YWtMod
I've added a copy of your Fact table as a "Cross Reference", Aliased the Product1 dimension as a "Cross Reference", set the Dimension references to Product independently from your existing relationships, and specified the Many-to-Many relationships.
It is returning the right answer in Excel (sample attached).
You could extend that pattern as many times as you need.
Good luck!
Mike
an other way to deal with this in SQL (I know it works, but I didn't test this query) is to use double negation
select distinct orderid
from X
where TK NOT in (
select TK
from X x_alias
where productid NOT in (id1,id2)
)
I'm pretty sure you can do the same in MDX.
Related
I've trouble understanding how this should work...basically I've 2 main tables, in one I've Revenues, in another Costs.
Revenues table has fields as: P&L (string), Category (string), Products (string), Sold (int), invoiced (int), delivered (int), date (date).
Costs table has: P&L (string), Category (string), Products (string), Costs (int), date (date).
I'd like to use tables together to perform various calcs like margin, for example, at any level (total margin, which means total revenues - total costs, or at Category level for which I should be able to filter any category I have and perform the calc and so on).
Problem is, any tentative I've made to use relations or join, resulted in duplications.
The only workaround I was able to perform so far is to leave revenues table as it is, and create many Costs table, 1 for field basically (table1 with category and costs plus date, table2 with products, costs and date etc.). Joining Revenues with one of these tables seems to work but, in this way, I'm not able to create a wider view (one goal is to make a big table in the viz where we could read at once all the data). Plus, another problem I 've seen it appear doing this workaround is that, if I want to split by date costs, but I use the date column from the revenues table, even if the date is the same (I've done a copy/paste between tables basically), tableau doesn't recognize the date correctly, so to split costs, I've to use costs'date column, and to split revenues, I've to use revenues' date columns, which is frankly a pain...
So my question: how could I merge the 2 tables in one, or anyway how could I put all the data together in a working table to perform any kind of calcs,and also how could I use just 1 column for date that works for all the date altogether?
I've upload a file here to understand better what I'm trying to combine. Thank you guys
Data file
ps.: seems that tableau is using sql behind for these tasks so probably someone skilled in this kind of problem in sql could also help...for this I 've tagged sql as well, thanks
You need to UNION those 2 tables together, but are they really in Google or you just did that to demo it here?
If you're using Excel - both Revenue & Cost must be different sheets in the same XLS file
If you're using CSV - both Revenue & Cost must be different files (hopefully in the same folder)
I would really hope that you're using a database (some form of SQL), but either of the above options, UNION the data and it will work the way you expect :)
I am using Informatica Intelligent Cloud Services (IICS) Intelligent Structure model to parse the JSON file that I have.The file is located on S3 bucket,and it contains 3 groups. 2 Groups contains lots of records (~100,000) and 3rd group contains (~10,000 records). According to Intelligent structure model, largest group contains PK, which I can use to join the other group, but the issue is for Master and Detail which group should I select ? Usually, group with lower records should be selected but in my case, lower records contains foreign key ? Is there a work around for this issue ?
I am new to IICS so how to resolve the issue ?
Any help will be appreciated. Thanks in advance!
Rule is, select table with samll rowcount should be master because during execution, the master source is cached into the memory for joining purpose.
Having said that, can you use 3rd group with less rows as master for both joins like below. If its normal join, logic remains same but perf will improve if you choose master with less rows and less granularity.
Sq_gr1(d)\
Sq_gr3-jnr1(m)->|jnr2----->
Sq_gr2(d)------>/
Outer join will take time equivalent to count of rows.
I am currently studying databases, and I have a question.
The professor told us to create 2 different databases, and then move all the data to a star schema data model.
Here is the diagram for the first database, I have already filled the tables with data.
This is the diagram for the second database, also with data
This is my star schema model
The problem I am facing is i that i do not know how to start doing the mapping when adding my origin OLE DB and destination OLE DB.
I have already searched through the web, but I only find examples where they have to move just one database to the star schema data model.
The task you have is to consolidate two transactional/OLTP systems into a data warehouse. The reason you find only examples of moving/mapping one system into a data warehouse is that you simply repeat the process for each additional source that feeds into the DW.
In your example, you are integrating two sales systems to produce a unified sales business process report.
My approach would be to copy all the tables as is into your DW and put them into a staging schema (stage_d1, stage_d2). This way, you have all the data locally and it's likely consistent with the start of your data extract i.e. as of 9:30 this morning.
Now that you have the data locally, you need to transform and enrich the data to populate your dimensions and then your fact table.
Let's analyze Dim_Customer. This table is probably a little light in terms of providing value but the methodology is what you should focus on. System 1 supplies a first and last name and a city, state and zipcode. System 2 gives us a Company name, contact name, city, state, postal code and phone.
Given the usage of postal code and zip code, that would have me wondering whether we're dealing with international addresses versus a US centric (zip code) data. I'd also notice that we don't have an actual address line for this data so the point is, analyse your data so you know that you're modeling something that solves the problem (report on sales across all systems).
The next question I'd wonder about is how we populate the Customer dimension. If a Mario Almaguer had a purchase in both system 1 and system 2, are they the "same" person? Does it matter for this business process? If we sold to a person in TX and that name also exists in ME, does it matter if the name is in there twice?
I'll assume we only care about unique customer names. If it's a bad assumption, we go back and model it differently.
In my source, I'll write a query.
SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName FROM stage_d1.Customer AS C;
Run that, see that it returns the data I want. I'll then use an OLE DB Source in an SSIS data flow and use the third drop down option of Source query.
If I run the package, we'll get all the unique customer names from the source system. But we only want the names we don't have so that means we need to use something to check our reference table for existing matches. That's the Lookup Component
The source for the Lookup will be the DW's Dim_Customer table. You'll match based on CustomerName. The lookup component will tell us whether an incoming row matched and we can get two output streams: Match and no-match. We're only interested in the no-match path because that's new data. Andy Leonard has an excellent Stairway to Integration Services and in particular, we're talking about an Incremental Load.
From the Lookup, we'll drag the no-match branch to an OLE DB Destination where we point at Dim_Customer table.
You run that and Dim_Customer is populated. Run it again, and no new rows should be added as we're looking for new data only.
Now we need to solve getting the second staged customer data integrated. Fortunately, it's the same steps except this time our query is easier.
SELECT DISTINCT C.ContactName AS CustomerName FROM stage_d2.Customers AS C;
Lather, rinse repeat for all of your other dimensions.
You could also skipped the data flows and simply executed a query to do the same.
INSERT INTO dbo.Dim_Customer(CustomerName)
SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName
FROM stage_d1.Customer AS C
WHERE NOT EXISTS (SELECT * FROM dbo.Dim_Customer AS DC WHERE DC.CustomerName = CONCAT(C.FirstName, ' ', C.LastName));
Lather, rinse, repeat for the remaining dimensions.
Loading the fact is similar except we will use the Lookup components to find matches (as we need to translate our data to their dimension ids). Here I'll show how we'd populate a simplified version of your fact table
SELECT O.Price AS UnitPrice, BO.OrderDate AS [Date], 1 AS Quantity, 0 AS Discount, CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName
FROM stage_d1.Ordering as O
INNER JOIN stage_d1.Book_Order AS BO
ON BO.OrderID = O.OrderID
INNER JOIN stage_d1.Customer AS C
ON C.CustomerID = BO.Cus_CustomerID;
That's my source query. The customer lookup will continue to match on Dim_Customer's CustomerName but this time we'll retrieve the CustomerID from the lookup component.
The destination then uses the UnitPrice, Date (depends on how you do it), Quantity and Discount directly from our source. The rest of the dimension keys, we populate through our lookup.
the standard approach would be to do the following:
Copy your source data into staging tables in your target database
Write the queries, necessary to populate the star schema, against the staging tables
You populate all your dimension tables first and then all your fact tables.
So I am working on an Invoicing Report in SSRS 2008. The database contains 4 relevant tables:
- Work Order
- Labor
- Materials
- Services (Subcontractors)
Obviously the work order table contains all the relevant information about the overall work order (we display things such as location, priority, etc). For this invoice report, I need to display the work order details up at the top, then show the labor, materials, and services used on the work order (with totals for each), then show a complete total for the entire cost of the work order.
My issue is this: I can do a dataset that works with Work Order + any ONE of the child tables, however I cannot figure out how to do all 3! I can't simply do a parameter for WONUM with 3 (or 4) tables on it, because this report will have MANY work orders (one per page) on it. When I use a dataset with the Work Order table and one child table, I group by WONUM then do a page break between each instance.
Any ideas for how to handle it? Most answers I came across say make one gigantic "union all" dataset and then group it after that, or use subreports for each child table. However, we will be exporting this report to Excel, and I've been told that subreports do not render properly when exported.
Any and all help is greatly appreciated! Thanks!
EDIT:
Below are the 4 queries I'd LIKE to use:
This retrieves all the work orders that need to be billed:
SELECT wonum, description, location FROM workorder WHERE billable=1 AND status='COMP'
This retrieves the labor for a work order (specified by #wonum)
SELECT regularhrs, laborrate, totalcost FROM labor WHERE refwo=#wonum
This retrieves the materials for a work order (specified by #wonum)
SELECT description, quantity, unitcost, totalcost FROM material WHERE refwo=#wonum
This retrieves the services (subcontractor hours) for a work order (specified by #wonum)
SELECT description, hours, laborrate, totalcost FROM service WHERE refwo=#wonum
So as I stated in the original post, my first query retrieves all the work orders that need to be billed, then for each work order (one per page), I need to retrieve the labor, materials, and services and display it in 3 tables below the work order details, then put an overall total cost on the work order invoice (at the end of each work order, not the end of all work orders)
I can get a screenshot of my current report if that would help also. Just let me know in comments!
Your query should look something like this
SELECT WO.wonum, WO.description as WorkorderDescription, WO.location, L.regularhrs, L.laborrate, L.totalcost,
M.description as MaterialDescription, M.quantity, M.unitcost, M.totalcost as MaterialCost,
S.description as ServiceDescription, S.hours, S.laborrate, S.totalcost as ServiceCost
FROM workorder AS WO
INNER JOIN labor AS L on L.refwo = WO.wonum
INNER JOIN material AS M on M.refwo = WO.wonum
INNER JOIN service AS S on S.refwo = WO.wonum
WHERE billable = 1 AND STATUS = 'COMP'
This will efficiently gather the information you need into one dataset. You will need to use the grouping features to setup the table in SSRS. You may have to do some additional research if you get stuck on getting the table layout right.
I have a django database application, which is constantly evolving.
We want to track the progress of samples as they progress from
sample -> library -> machine -> statistics, etc.
Generally it is a one to many relationship from each stage left to right.
Here is a simplified version of my database schema
table sample
id
name
table library
id
name
sample_id (foreign key to sample table)
table machine
id
name
status
library_id (foreign key to library table)
table sample_to_projects
sample_id
project_id
table library_to_subprojects
library_id
subproject_id
So far it has been going ok, except now, everything needs to be viewed by projects. Each of the stages can belong to one or more projects. I have added a many_to_many relation between project and the existing tables.
I am trying to create some views that do the multiple left joins and show the progress of samples for a project.
sample A
sample B library_1 machine_1
sample B library_2 machine_2
sample C library_3
first try at the query was like this:
SELECT fields FROM
sample_to_projects ,
sample
LEFT JOIN library ON sample.id = library.sample_id ,
library_to_project
LEFT JOIN machine ON machine.library_id = library.id
WHERE
sample_to_project.project_id = 30
AND sample_to_project.sample_id = sample.id
AND library_to_project.project_id = 30
AND library_to_project.library_id = library_id
The problem here is that the LEFT JOIN is done before the WHERE clause.
So if we have a sample that belongs to project_A and project_B.
If the sample has a library for project_B, but we want to filter on project_A, the LEFT JOIN does not add a row with NULLs for library columns (as there are libraries). However these rows get filtered back out by the WHERE clause, and the sample does not show up.
reults filtering on project_A
sample_1(project_A, project_B) library_A (project_A)
sample_1(project_A, project_B) library_B (project_A, project_B)
sample_2(project_A, project_B) library_C (project_B) *this row gets filtered out, it should show only the sample details*
So my solution is to create a subquery to join the other (right hand side) tables before the LEFT JOIN is done.
SELECT fields FROM
sample_to_projects ,
sample
LEFT JOIN (
SELECT library.id as lib_id , library.sample_id as smaple_id , library.name as lib_name , machine_name
FROM library ,
lib_to_projects ,
machine
)
AS join_table ON sample.id = join_table.sample_id
WHERE
sample_to_project.project_id = 30
AND sample_to_project.sample_id = sample.id
The problem is that there are a few more stages in the real version of my database, so I will need to do a nested subquery for each LEFT JOIN. The SQL will be getting pretty large ad difficult to read, and I wondered if there is a better solution at the design level? Also it won't play nicely with Django models (though if I can get the SQL working I will be happy enough).
Or can anyone suggest some sort of best practices for this type of problem? I am sure it must be relatively common with showing users in groups or something similar. If anyone knows a way that would fit well with django models that would be even better.
What about creating sepatate views for each Project_Id?
If you leave the database structure as is and add to it as the application progresses. You can create a separate view for each stage or Project_Id. If there are 30 stages (Project_Id 1..30) then create 30 separate views.
When you add a new stage... create a new view.
I'm not precisely clear on what you're using this for, but it looks like your use-case could benefit from Pivot Tables. Microsoft Excel and Microsoft Access have these, probably the easiest to set up as well.
Basically, you set up a query that joins all your related data together, possibly with some parameters a user would fill in (would make things faster if you have large amounts of data), then feed the result to the Pivot Table, and then you can group things any way you want. You could, on the fly, see subprojects by library, samples by machine, libraries by samples, and filter on any of those fields as well. So you could quickly make a report of Samples by Machine, and filter it so only samples for machine 1 show up.
The benefit is that you make one query that includes all the data you might want, and then you can focus on just arranging the groups and filtering. There are more heavy-duty systems for this sort of stuff (OLAP servers), but you may not need that if you don't have huge amounts of data.