SSAS calculated measure: Access relational database - many-to-many

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.

Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

Related

Partial Data Set in WEBI 4.0

When I run a query in Web Intelligence, I only get a part of the data.
But I want to get all the data.
The resulting data set I am retrieving from database is quite large (10 million rows). However, I do not want to have 10 million rows in my reports, but to summarize it, so that the report has the most 50 rows.
Why am I getting only a partial data set as a result of WEBI query?
(I also noticed that in the bottom right corner there is an exclamation mark, that indicates I am working with partial data set, and when I click on refresh I still get the partial data set.)
BTW, I know I can see the SQL query when I built it using query editor, but can i see the corresponding query when I make a certain report? If yes, how?
UPDATE: I have tried the option by editing the 'Limit size of result set to:' in the Query Options in Business Layer by setting the value to 9 999 999 and the again by unchecking this option. However, I am still getting the partial result.
UPDATE: I have checked the number of rows in the resulting set - it is 9,6 million. Now it's even more confusing why I'm not getting all the rows (the max number of rows was set to 9 999 999)
SELECT
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Description_TXT,
count(I_ATA_MV_FinanceTreasury.VWD_Party_A.Party_KEY)
FROM
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A RIGHT OUTER JOIN
I_ATA_MV_FinanceTreasury.VWD_Party_A ON
(I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Segment_Value_KEY=I_ATA_MV_FinanceTreasury.VWD_Party_A.Segment_Value_KEY)
GROUP BY 1
The "Limit size of result set" setting is a little misleading. You can choose an amount lower than the associated setting in the universe, but not higher. That is, if the universe is set to a limit of 5,000, you can set your report to a limit lower than 5,000, but you can't increase it.
Does your query include any measures? If not, and your query is set to retrieve duplicate rows, you will get an un-aggregated result.
If you're comfortable reading SQL, take a look at the report's generated SQL, and that might give you a clue as to what's going on. It's possible that there is a measure in the query that does not have an aggregate function (as it should).
While this may be a little off-topic, I personally would advise against loading that much data into a Web Intelligence document, especially if you're going to aggregate it to 50 rows in your report.
These are not the kind of data volumes WebI was designed to handle (regardless whether it will or not). Ideally, you should push down as much of the aggregation as possible to your database (which is much better equipped to handle such volumes) and return only the data you really need.
Have a look at this link, which contains some best practices. For example, slide 13 specifies that:
50.000 rows per document is a reasonable number
What you need to do is to add a measure to your query and make sure that this measure uses an aggregate database function (e.g. SUM()). This will cause WebI to create a SQL statement with GROUP BY.
Another alternative is to disable the option Retrieve duplicate rows. You can set this option by opening the data provider's properties.
.

Ranking search results

I have a tutoring website with a search feature. I want tutors to appear on the list according to several weighted criteria, including whether or not they are subscription holders, if they have submitted a profile photo, if they have included a lot of information about themselves, etc...
Basically, I have a lot of criteria by which I would like to weigh their rank.
Instead of writing a complicated SQL query with multiple ORDER BYs (if this is even possible), I was thinking of creating a table (maybe a temporary one), that assigns numerical values based on several criteria to come up with a final search rank.
I'm not entirely sure about how to go about this, or if this is a good idea, so I would like to know what the community thinks about a) this method, and b) possible ways of implementing this in SQL.
I would add a field to one of the existing tables that more or less was a representation of their "weight" for sorting purposes. I would then populate this column with a database procedure that ran every so often (you could make a queue that only runs on records that have been updated, or just run it on all records if you want). That way, I can just pull back the data and order by one column instead of multiple ones.
Also, you could use a View. It really depends on if you want to number crunching to be done by the procedure or by the database every time you pull data (for a search feature and for speed's sake, I'd suggest the database procedure).

Most efficient way to select lots of values from database

I have a database table with around 2400 'items'. A user will have any arbitrary combination of items from the 2400 item set. Each item's details then need to be looked up and displayed to the user. What is the most efficient way to get all of the item details? I can think of three methods:
Select all 2400 items and parse the required details client-side.
Select the specific items with a SELECT which could be a very long SQL string (0-2400 ids)?
Select each item one at a time (too many connections)?
I'm not clued up on SQL efficiency so any guidance would help. It may help to know this is a web app heavily AJAX based.
Edit: On average a user will select ~150 items and very rarely more than 400-500.
The best method is to return the data you want from the database in a single query:
select i.*
from items i
where i.itemID in (<list of ids>);
MySQL queries can be quite large (see here), so I wouldn't worry about being able to pass in the values.
However, if your users have so many items, I would suggest storing them in the database first and then doing a join to get the additional information.
If the users never/rarely select more than ~50 elements, then I agree with Gordons answer.
If it is really plausible that they might select up to 2400 items, you'll probably be better off by inserting the selected ids into a holding table and then joining with that.
However, a more thorough answer can be found here - which I found through this answer.
He concludes that:
We see that for a large list of parameters, passing them in a temporary table is much faster that as a constant list, while for small lists performance is almost the same.
'Small' and 'large' are hardly static, but dependent upon your hardware - so you should test. My guess would be that with an average of 150 elements in your IN-list, you will see the temp table win.
(If you do test, please come back here and say what is the fastest in your setup.)
2400 items is nothing. I've have a MySql database which has hundreds of thousands of rows and relation and are working perfectly, by just optimizing the queries.
What you must do is, see how long the execution time is for each sql query. You can then optimize each query on its own, trying different querys and measure the execution time.
You can use ex. MySql Workbench, Sequel pro, Microsoft Server management studio or another software for building queries. Also you can add indexes to your tables, which can improve queries as well.
If you need to scale your database up you can use software like http://hadoop.apache.org
Also a great thing to mention is NoSQL (Not-Only SQL). It's a relation-less database, which can handle dynamic attributes and are build for handling large amount of data.
As you mention you could use AJAX. But that only helps the load time of the page and the stress of your web server (Not SQL server). Just ask if you wan't more info or more in-depth explanation.

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.

Best approach to construct complex MySQL joins and groups?

I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.
I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.
Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More
I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).