spotfire: How to add a datapoint to the scatterplot which represents the average of displayed points - data-analysis

With the Tibco software Spotfire is there a way to easily add a point to the current displayed scatter point which is the average of the other points on the plot?
NOTE for my case there will be filtering, ie I have 3 columns of data
myCategory X Y
cat1 1 1
cat1 2 2
cat2 10 10
cat2 20 20
So essentially when I filter and select cat1 I would like 3 points 1,1 ; 2,2 and 1.5,1.5
similarly with cat 2 selected I would like the extra point 15,15 displayed.
Is there a way to accomplish this?
NOTE: I think the OVER() function might be useful to calculate an average. Also it might be possible to do it by adding a calculated column with the average, but it would be better if I had a solution without an additional column since the dataset is huge.

I have been working toward a solution for this, and because of limitations with what can be done on the x-axis, I don't see a way to make this happen without a calculated column. Alternatively, I do see a way to display the data you're looking for in a different way. Average lines can be added vertically and horizontally with labels that show the value (or not). The point, visually speaking, resides at the intersection of the average lines. Feel free not to mark this as accepted, as it does not provide the extra point you are looking for.

Related

SQL - How to find optimal performance numbers for query

First time here so forgive me for any faux pas. I have a question about the limitation of SQL as I am new to the code, and what I need I believe to be rather complex.
Is it possible to automate finding the optimal data for a specific query. For example, say I have the following columns:
1) Vehicle type (Text) e.g. car,bike,bus
2) Number of passengers (Numeric) e.g. 0-7
3) Was in an accident (Boolean) e.g. t or f
From here, I would like to get percentages. So if I were to select only cars with 3 passengers, what percentage of the total accidents does that account for.
I understand how to get this as a one off or mathematically calculate it, however my question relates how to automate this process to get the optimum number.
So, keeping with this example, say I look at just cars, what number of passengers covers the highest percentage of accidents?
At the moment, I am currently going through and testing number by number, is there a way to 'find' the optimal number? It is easy when it is just 0-7 like in the example, but I would naturally like to deal with a larger range and even multiple ranges. For example, say we add another variable titled:
4) Number of doors (numeric) e-g- 0-3
Would there be a way of finding the best combination of numbers from these two variables that cover the highest percentage of accidents?
So say we took: Car, >2 passengers, <3 doors on the vehicle. Out of the accidents variable 50% were true
But if we change that to:Car, >4 passengers, <3 doors. Out of the accidents variable 80% were true.
I hope I have explained this well. I understand that this is most likely not possible with SQL, however is there another way to find these optimum numbers?
Thanks in advance
Here's an example that will give you an answer for all possibilities. You could add a limit clause to show only the top answer, or add to the where clause to limit to specific terms.
SELECT
`vehicle_type`,
`num_passengers`,
sum(if(`in_accident`,1,0)) as `num_accidents`,
count(*) as `num_in_group`,
sum(if(`in_accident`,1,0)) / count(*) as `percent_accidents`
FROM `accidents`
GROUP BY `vehicle_type`,
`num_passengers`
ORDER BY sum(if(`in_accident`,1,0)) / count(*)

Metrics-Database Number Saving Conventions?

I want to save metrics (statistics) every day for some processes on my website to display them later in graphs. Example for Metrics might be:
FacebookLikes
SiteVisitors
Now I want to know how to design the MySQL Table. - Whether I should save the "DeltaFacebookLikes" and "DeltaSiteVisitors" - or whether I should save the absolute numbers which keep growing by each entry:
ID DATE FACEBOOK_LIKES SITE_VISITORS
The first example (saving the delta values) would be:
The problem here is that I would never have the "total" values - unless I sum them up.
1 23.10 33 50
2 24.10 14 80
3 25.10 12 5
4 26.10 28 105
The second example (saving the absolute values) would be:
The problem here is that I have the total values, but I would need to subtract v(x) from v(x+1) to archive the actual delta value.
1 23.10 33 50
2 24.10 47 130
3 25.10 59 135
4 26.10 87 240
What would be the right/wrong way? Is there any wrong way?
Or make both combined sense?
I don't think there is a right answer to this question, but my intuition says to store the incremental numbers on each day.
Storing the cumulative numbers can be quite efficient. You can readily get the difference between two days, just by looking up two values in the table. This is particularly efficient, if you have users who are asking about the number of facebook likes for arbitrary time frames.
On the other hand, the individual numbers have certain other advantages:
If you make a mistake or miss a day's load, then fixing the problem is easier.
You can more readily calculate standard deviation and variance.
You can more readily calculate averages by non-contiguous time frames -- such as the Monday average.
It is easier to do trend analysis.
So, from an analytic perspective, I would find independent daily measurements to be a better choice. Of course, you can readily translate from one to the other, so the best choice may depend on user access patterns.

Birthday Paradox: How to programmatically estimate the probability of 3, and N, people sharing a birthday

There are extensive resources on the internet discussing the famous Birthday Paradox. It is clear to me how you calculate the probability of two people sharing a birthday i.e. P(same) = 1 - P(different). However if I ask myself something apparently more simple I stall: firstly, let's say I generate two random birthdays. Getting the same birthday is like tossing a coin. Either the two persons share a birthday (Heads) or they don't share a birthday (Tail). Run this 500 times and the end result (#Heads/500) will somehow be close to 0.5
Q1) But how do I think about this if I generate three random birthdays? How can I estimate the probability then? Obviously my coin analogy won't be applicable.
Q2) once I have figured out the above I will need to scale it up and generate 30 or 50 birthdays. Is there a recommended technique or algorithm to isolate identical birthdays from a large set? Should I put them into arrays and loop through them?
Here's what I think I need:
Q1)
r = 25 i.e. each trial run generates 25 birthdays
Trial 1 >
3 duplicates: 0
Trial 2 >
3 duplicates: 0
Trial 3 >
3 duplicates: 2
Trial 4 >
3 duplicates: 1
...
T100 >
3 duplicates: 2
estimated probability of 3 persons sharing a birthday in a room of 25 = (0+0+2+1+...+2)/100
Q2)
Create an array for 2 duplicates, an array for 3 duplicates and one for more than 3 duplicates
add each generated birthday one by one into the first array. But before doing so, loop through the array to see if it's in there already. If so, add it to the second array, but before doing so repeat the above process and so on
It doesn't seem to be a very efficient algorithm though :) suggestions to improve the Big O here?
Create an integer array of length 365, initialized to 0. Then generate N (in your case 25) random numbers between 1-365 and increase that number in the array (ie. bdays[random_value]++). Since you are only interested in a collision happening, right after increasing the number in the array check if it is greater than 2 (If it is then there is a second collision, which means there are 3 people with the same birthday). Keep track of collisions and execute this as many times as you wish (1000).
In the end, the ratio of collisions/1000 will be your requested value.
and, no tossing coins analogy is wrong.
Check this similar question and its answers on CrossValidated, but I think it is really worth thinking about the classic Birthday problem again to get the basics.
To the second part of your question: depends on the language you use. I definitely suggest using R to solve a problem like that, as checking identical birthdays in a list/vector/data frame can easily done with a simple unique call. To run a such simple MC simulation R is again really handy, check the second answer on the link above.
Sounds like your first task will be to create a method that will generate random birthdays. To keep things simple, you can use the numbers 1-365 to denote unique birthdays.
Store however many random birthdays (2 in the first case more later) in an ArrayList as Strings. You will want to use a loop to call the random number function and store the value in your list.
Then make a function to search the ArrayList for duplicates. If there are any duplicates (no matter how many) then that's a Heads result. If there are no matches then it's a Tails.
Your probabilities will be far different from 50/50 until you get to 20 or so.

How to reduce the impact of attack frequency in a game?

The goal is to make the frequency not so dominating.
Suppose A has an attack frequency of 100,and B's is 2.
But I don't want to see such a big difference.
I want to reduce the difference,how?
The goal is that A is at most 5 times faster than B,not 100/2=50.
But should make sure A is faster than B.
So I need a mechanism to achieve this.
Use the logarithm function to reduce the scale. For example in log base 2, A's score is between 6 and 7, while B has a score of 1. Multiply by a constant afterwards if you wish to scale the values up again. You can change the base of the logarithm to adjust how much you want to even out the differences.
Update: You will probably also want to add 1 to the score before taking the logarithm to ensure that scores below 1 don't get converted to large negative numbers.
You might consider using a gaussian around 100 for A and 2 for B. Digg into non uniform random generators.
Or you can, determine another attribut for your game and use the frequency as a factor !

Where to split Directrory Groupings? A-F | G-K | L-P

I'm looking to build a "quick link" directory access widget.
e.g. (option 1)
0-9 | A-F | G-K | L-P | Q-U | V-Z
Where each would be a link into sub-chunks of a directory starting with that character. The widget itself would be used in multiple places for looking up contacts, companies, projects, etc.
Now, for the programming part... I want to know if I should split as above...
0-9 | A-F | G-K | L-P | Q-U | V-Z
10+ 6 5 5 5 5
This split is fairly even and logically grouped, but what I'm interested to know is if there is a more optimal split based on the quantity of typical results starting with each letter. (option 2)
e.g. very few items will start with "Q".
(Note: this is currently for a "North American/English" deployment.)
Does anyone have any stats that would backup reasons to split differently?
Likewise, for usability how do users like/dislike this type of thing? I know mentally if I am looking for say: "S" it takes me a second to recall it falls in the Q-U section.
Would it be better to do a big list like this? (option 3)
#|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
I would suggest one link per letter and hide the letters that don't have any results (if that doesn't asks for too much processing power).
As a user I would most definitely prefer one link per letter.
But better (for me as a user) would be a search box.
I think you're splitting the wrong things. You shouldn't evenly split letters, you should evenly split the results (as best as you can).
If you want 20 results per page, and A has 28, while B-C have 15 you'll want to have
A
B-C
and so on.
Additionally, you might have to consider why you are using alphabet chunking instead of something a bit more contextual. The problem with alphabet chunking is that users have to know the name of what they are looking for, and that name has to be the same as yours.
EDIT: We've tested this in lab conditions, and users locate information in chunk by results vs chunk by number of letters in pretty much the same way.
EDIT_2: Chunking by letters almost always tests poorly. Think if there are any better ways to do this.
Well, one of the primary usability considerations is evenly-distributed groups, so either your current idea (0-9, A-F, etc.) would work well, or the list with each individual letter. Having inconsistently-sized groups is a definite no-no for a user interface.
You probably definitely don't want to split across a number - that is, something like
0-4 | 5-B | ...
Besides that, I'd say just see where your data lies. Write a program to do groupings of two, three, four, five, etc... and see what the most even split for each grouping is. Pick the one that seems nicest. If you have sparse data, then having one link per letter might be annoying if there are only 1 or 2 directories with that name.
Then again, it depends what a typical user will be looking for. I can't tell what that might be just from your description - are they just navigating a directory tree?
I almost always use the last option since it is by far the easier to navigate for a user. Use that if you have enough room for it and the other one if you have a limited amount of screen estate.