How do I write a measure in this power pivot table that will only sum values next to a unique value? - duplicates

I want to sum 'hours' in this table. Every 'item's' hours should be counted once, even if it appears twice. So Group A has 12.25 hours, in the example below.
Here is the source table:
A PowerPivot gives me:
So it's double counting rows where 'item' occurs twice, of course.
Because the 'hours' for different 'items' aren't the same, I'm not sure how to write a DAX measure to make this work in the pivotable (this is just an example, real dataset is the same problem but much larger). I tried
=([Sum of Hours]/COUNT([Hours]))*DISTINCTCOUNT([Item])
However it's not the correct calculation. It gave me 9.84375 for group A (right answer 12.25) and 47.53125 for group B (44 is correct).
You can see this from a deduped list (for unrelated reasons, it's not feasible to dedupe the list).
What measure (or combo of them) is going to give me what I need?
Thanks!

CALCULATE( SUMX( VALUES( Table1[Item] ), CALCULATE( MIN( Table1[Hours] ) ) ) )
Sorry for the delay. Your last request helped distract me after I messed up in an interview.
I tried to make it as simple as possible:
The outer CALCULATE would be necessary only if you want to overwrite the filter contexts (slicers, row headers, column headers) present in your report.
Which route to take to learn DAX depends on your available time. I have always been a believer that your first approach to this type of language should be something practical, more focused on scenarios and solutions so that you do not lose the will to learn (Learn to Write DAX A Practical Guide to Learning Power Pivot for Excel and Power BI by Matt Allington). Once you are interested in the topic, you can go to mid-level books like (Microsoft Excel 2013: Building Data Models with PowerPivot, Analyzing Data with Microsoft Power BI and Power Pivot for Excel ...) or jump directly to The Definitive Guide to DAX: Business intelligence with Microsoft Power BI, SQL Server Analysis Services, and Excel (Second Edition). Finally, understanding the language is about 80% of the way, you need practice and identify patterns: DAX Patterns (Russo, Marco Ferrari, Alberto) and all the SQLBI resources will be very helpful for you.

Related

SQL Database design for statistical analysis of many-to-many relationship

It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.
I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:
Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.
The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.
So my questions are:
Does this design make sense? What improvements might you make?
What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?
Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.
Thanks!
I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.
It should be reasonably easy to create views for each of your statistics. For example:
CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;
CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);
SELECT
rr.Runner_ID,
COUNT(*) AS Number_of_races,
COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID
1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.
2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible.
If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.
3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

SSAS calculated measure: Access relational database

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

Automate creation and deployment of SSRS report from single table query

What is the most efficient way to automate both creation and deployment of simple SSRS reports from one underlying query?
An example query might look like
SELECT Name, ID, Date FROM Errorlog
Query could contain quite a few columns and anywhere from 1 to 1 million rows.
The business purpose behind this question is that I have a sizable number of report queries that need to go out as SSRS reports. I also need the capacity to turn any query I write instantly (or within a matter of seconds) into a simple SSRS report. Unfortunately, doing it through BIDS manually (using toolbox items and creating datasets is cumbersome, slow and unnecessarily repetitive. The only thing I am concerned with is making sure interactive page height/width is zero (to allow scrolling) and that columns are autosized.
How would you accomplish this in a way that is smooth and repeatable?
Let me start by saying that I don't think SSRS will not be very good at this. Specifically on two points this may be troublesome.
First, the number of rows may become a problem. One million results is typically a bit much for reporting services 2008 (though it does depend on the context a bit), it's much better at displaying either aggregated data, or a limited number (up to a few thousand - though again: depending on context) of data rows.
Second, a dynamic number of columns being returned by the SQL side will be a problem. There's only two ways around this that I know of:
Have a denormalized data set with a fixed number of columns, and one or more columns that contain the grouping. Then use a matrix to generate columns dynamically in SSRS. This does have a considerable performance impact.
Generate the RDL dynamically. There's information on the schema to do this, and if you create a good starting point it's very possible. After generating the RDL you'll have to execute it - how to do that depends on your specific setup.
Bottom line is that I wouldn't recommend using SSRS for the task you describe. Consider other technologies that may be better up to this task, e.g. SSIS packages, or perhaps another custom made or third party tool?
If I were you, I'd utilize 'Access Data Projects' which have a wizard for creating report.. that is then easy to upsize to Reporting Services. Right-click IMPORT into a solution full of RDL, and it prompts for MS Access file.
You can easily make a couple of columns into a report using an Access wizard, and then upsize to SSRS.. I've done it hundreds upon hundreds of times like this.

Top k problem - finding usage for my academic work

Top k problem - searching BEST k (3 or 1000) elements in DB
There is fundamental problem with relational DB, that to find top k elems, there is a need to process ALL rows in table. Which make it useless on big data.
I'm making application (for university research, not really my invention, I'm implementing and trying to improve original idea) that allows you to effectively find top k elements by visiting only 3-5% of stored data. Which make it really fast.
There are even user preferences, so on some domain, you can specify function that specify best value for user and aggregation function that specify most significant attributes.
For example DB of cars: attributes:(price, mileage, age of car, ccm, fuel/mile, type of car...) and user values for example 10*price + 5*fuel/mile + 4*mileage + age of car, (s)he doesn't care about type of car and other. - this is aggregation specification
Then for each attribute (price, mileage, ...), there can be totally different "value-function" that specifies best value for user. So for example (price: lower, the better, then value go down, up to $50k, where value is 0 (user don't want car more expensive than 50k). Mileage: other function based on his/hers criteria, ans so on...
You can see that there is quite freedom to specify your preferences and acording to it, best k elements in DB will be found quickly.
I've spent many sleepless night thinking about real-life usability. Who can benefit from that query db? But I failed to whomp up anything and sticking to only academic write-only stance. :-( I hope there can be some real usage for that, but I don't see any....
.... do YOU have any idea how to use that in real-life, real problem, etc...
I'd love to hear from You.
Have a database of people's CVs and establish hiring criteria for different jobs, allowing for a dynamic display of the top k candidates.
Also, considering the fast nature of your solution, you can think of exploiting it in rendering near real-time graphs of highly dynamic data, like stock market quotes or even applications in molecular or DNA-related studies.
New idea: perhaps your research might have applications in clustering, where you would use it to implement a fast k - Nearest Neighbor clustering by complex criteria without having to scan the whole data set each time. This would lead to faster clustering of larger data sets in respect with more complex criteria in picking the K-NN for each data node.
There are unlimited possible real-use scenarios. Getting the top-n values is used all the time.
But I highly doubt that it's possible to get top-n objects without having an index. An index can only be built if the properties that will be searched are known ahead of searching. And if that's the case, a simple index in a relational database is able to provide the same functionality.
It's used in financial organizations all the time, you need to see the most profitable assets / least profitable, etc.