calculating the density of a set - mysql

(I wish my mathematical vocabulary was more developed)
I have a website. On that website is a video. As a user watches the video, a bit of javascript stores how far they have gotten so far in the video. When they stop watching the video, that number of seconds is stored. There's no pattern to when the js will do this, unfortunately.
So if one person is watching the video, we might see this set:
3
6
8
10
12
16
And another person might get bored immediately:
1
3
This data is all stored in the same place, anonymously. So the sorted table with all this info would look like this:
1
3
3
6
8
10
12
16
Finally, the amount of times the video is started at all is stored. In this case it would be 2.
So. How do I get the average 'high-time' (the farthest reached point in the video) for all of the times the video was played?
I know that if we had a value for every second:
1
2
3
4
5
6
7
...
14
15
16
1
2
3
Then we could count up the values and divide by the number of plays:
(19) / 2 = 9.5
Or if the data was otherwise uniform, say in increments of 5, then we could count that up and multiply it by 5 (in the example, we would have some loss of precision, but that's ok):
5
10
15
5
(4) * 5 / 2 = 10
So it seems like I have a general function which would work:
count * 1/d = avg
where d is the density of the numbers (in the example above with 5 second increments, 1/5).
Is there a way to derive the density, d, from a set of effectively random numbers?

Why not just keep the last time that has been provided, and average across those? If you either throw away, or only pay attention to, the last number, it seems like you could just average over these.
You might also want to check out the term standard deviation as the raw average of this might not be the most useful measurement. If you have the standard deviation as well, it could help you realize that you have an average of 7, but it is composed of mostly 1's and 15's.
If you HAVE to have all the data, like you suggested, I will try and think about this a little bit more. I'm not totally certain how you can associate a value with all the previous values that came with it. Do you ALWAYS know the sequence by which numbers are output? If so, I think I know of a way you could derive the 'last' one, which might be slightly computationally expensive.
If you only have a sequence of integers, I think you may be able to increase each value (exponentially?) to 'compensate' for the fact that a later value 'contains' earlier values. I'm still working through this idea, but maybe it will give someone else a seed. What if you average over the sum of these, and then take the base2 logarithm of this average? Does that provide any kind of useful metric? That should 'weight' the later values to the point where they compensate for the sum of earlier values. I think.
In python-esk:
sum = 0
numberOf = 0
for node in nodes:
sum = sum + node.value ^ 2
numberOf = numberOf + 1
weightedAverage = log(sum/numberOf, 2)
print weightedAverage
print "Thanks Brian"

I think that #brian-stiner is on the right track in one of his comments.
Start with something like:
1
3
3
6
8
10
12
16
Turn that into numbers and counts.
1, 1
3, 2
6, 1
8, 1
10, 1
12, 1
16, 1
And then reading from the end down, find all of the points that happened more often than any remaining ones.
3, 2
16, 1
Take differences in counts.
3, 1
16, 1
And you have an estimate of stopping places.
This will not be an unbiased estimate. But if the JavaScript is independently inconsistent and the number of people is large, the biases should be fairly small.
It won't be right, but it will be close enough for government work.

Assuming increments are always around 5, some missing, some a bit longer or shorter. Then it won't be easy (possible?) to do this exactly. My suggestion: compute something like a 'moving count'. Similar to moving average.
So, for second 7: count how many numbers are 5,6,7,8 or 9 and divide by 5. That will give you a pretty good guess of how many people watched the 7th second. Do the same for second 10. The difference would be close to the number of the people who left between second 7 and 10.

To get the total time watched for each user, you'll have parse the list smallest to largest. If you have 4 views, you'll go through your list until you find that you no longer have 4 identical numbers, the last number where you had 4 identical numbers is the maximum of the first view. Then you'll look for when the 3 identical numbers stop, and so on. For example:
4 views data:
1111222233334445566778
4 views side by side:
1 1 1 1
2 2 2 2
3 3 3 3 <- first view max is 3 seconds
4 4 4 <- second view max is 4 seconds
5 5
6 6
7 7 <- third view max is 7 seconds
8 <- fourth view max is 8 seconds
EDIT- Oh, I just noticed that they are not uniform. In that case, the moving average would probably be your best bet.

The number of values roughly corresponds to the number of time periods in which your javascript sends the values (minus 1/2 if the video stop is accompanied with a obligatory time posting, since its moment is random within the interval).
If all clients have similar intervals and you know them, you may just use:
SELECT (COUNT(*) - 0.5) * 5.0 / (SELECT counter FROM countertable)
FROM ticktable
5.0 is the interval between the posts here.
Note that it does not even look at the values: you could as well just store "ticks".

For the max time, you could use MAX() on your field. Perhaps something like...
SELECT MAX(play_time) AS maxTime FROM video
Which would give you the longest time someone has played the video for.
If you want other things, like AVG() then you'll need more complex queries, for collecting on a per-user basis etc etc.
MySQL also contains a Standard Deviation function called STDDEV() and STD() which could help you too.

Related

Non Scaled SSRS Line Chart with mulitple series

I am trying to present time series of multiple sensors on a single SSRS (v14) line chart
I need to plot N series, with each independently plotting the series data in the space provided by the chart (independent vertical axis)
More about the data
There can be anywhere from ~1-10 series
The challenge is that they are different orders of magnitude.
One might be degrees F (~0-212)
One might be Carbon ppm (~1-16)
One might be Ftlbs Thrust (~10k-100k)
the point is , they have no relation and can be very different
The exact value is not important. I can hide the vertical axis
More about what I am trying to do
The idea is to show the multiple time series, plotted together against time for the 4 hours before and after
'an event'. Its not the necessarily the exact value that is important. the subject matter expert would be looking for something odd (temperature falls, thrust spikes, etc).
Things I have tried
If there were just 2 series, i could easily use the 2nd axis available in the SSRS chart. Thats exactly the idea I am chasing. But in this case, I want N series to plot using its own axis.
I have tried stacking N transparent graphs on top of each other. This would be a really ugly solution, but SSRS even wont let you do it. It unstacks them for you.
I have experimented with the Allow Scale Breaks property on the Vert Axis. This would solve the problem but we don't like the 'double jagged line'
Turning on Logarithmic scale is a possibility. It does do a better job of displaying all the data. but its not really what we want. Its going to change the shape of data that ranges over a couple orders of magnitude.
I tried the sparkline component and am having the same problem.
This approach is essentially the same a Greg's answer above. I've had to do this same process in the past comparing trends of data even though the units were dissimilar.
I took a very simple approach of adding an additional column to the query that showed each value as a percentage of the maximum value in each series.
As an example (just 2 series here for clarity) I started with data like this in myTable
Series Month myValue
A Jan 4
A Feb 8
A Mar 16
B Jan 200
B Feb 300
B Mar 400
My Dataset query would be something like.
SELECT *, myValue / MAX(myValue) OVER(PARTITION BY Series) as myPlotValue FROM myTable
This gives us a final dataset which looks liek this.
Series Month myValue myPlotValue
A Jan 4 0.25
A Feb 8 0.5
A Mar 16 1
B Jan 200 0.5
B Feb 300 0.75
B Mar 400 1
As you can see all plot values are now between 0 and 1.
I created that charts using the myPlotValue field and had the option of using the original values from the myValue field as datapoint labels.
After talking to some math people, this is a standard problem and it is solved by a process called normalization of the data.
Essentially you are changing all the series to fit in a given range (usually 0-1)
You can scale and add an offset if that makes sense for your problem domain somehow.
https://www.statisticshowto.datasciencecentral.com/normalized/

Difference between quantile results and iqr

I'm trying to understand a little more about how Octave calculates quartiles and interquartile range. Consider the following:
A=[1 4 7 10 14];
quantile(A, [0.25 0.75])
ans = 3.2500 11.0000
This result seems consistent with Method 3 on the Wikipedia page about quartiles. Given that the interquartile range is Q3-Q1, I'd expect the result to be 7.75.
However, running iqr(A) gives a result of 6. Clearly this is calculated from 10 minus 4 from the original data, which is consistent with Method 2 from the same Wikipedia page.
What is the reason for using two different methods for calculating Q1 and Q3?

Reverse Check digit algorithm

I am trying to reverse engineer an algorithm used to generate a check digit.
Numbers are 8 digits long and the last digit is the check digit. I have thousands of valid numbers to test it on.
I have try several standard algorithm but come up with nothing
Here is some examples of valid numbers:
3482145 6
3482146 4
3482147 2
3482148 3
3482149 9
3482150 1
3482151 0
3482152 8
3482153 6
3482154 4
3482155 2
3482156 3
3482157 9
3482158 7
3482159 5
3482160 8
3482161 6
Is it possible to calculate this? Any ideas?
The amount of data you provided is insufficient to adequately assess the algo. The only thing I can see right now is that the sequence 64239xx8 is repeated twice, and the last digit is also 6.
Not an actual answer, I`m afraid, but StackOverflow does not yet allow me to leave comments.
The algorithm is this:
coef[]={4,2,1,6,3,7,9}
modulus 11
Case 10->0
Case 0->3

Sorting by preferred average value

I'm making a game that involves people downloading and rating user-created maps. They have the option to upvote/downvote the map if they like/dislike it, as well as rate its difficult on a scale of 1-10.
In the map browser, they have the option to sort maps by highest rated. This is done using Laplace smoothing so that it factors both the number of upvotes as well as the total number of ratings into the sorting, sorting by (upvotes + 1)/(numRatings + 2). This works fine.
Now, there's also an option to sort by preferred difficulty, where people can choose a value from 1-10, and it will sort maps by how close the average difficulty rating is to the preferred rating. At first I was sorting by ABS(preferred_difficulty - average_difficulty), but that didn't factor in the number of ratings. Right now I'm using ((numRatings + 1) * (10 - ABS(average_difficulty - preferred_difficulty)) + 1) / (numRatings + 1.5) out of sheer trial and error, which kinda works, but sometimes the number of ratings outweighs the preferred difficulty and the results look strange.
This is what I need help with - I can't figure out how to sort by smallest difference between preferred and average difficulty while incorporating number of ratings into the mix, since I want a low difficulty delta with a high rating count to be the best result instead of a high upvote count with a high rating count, like it is with the ratings.
For example, if this is the data:
AvgDifficulty NumRatings
6.0 1
4.0 25
6.8 4
6.2 3
6.5 20
6.2 1
6.4 3
And someone chooses a preferred difficulty of 6.4, I'd want it to sort something like this:
AvgDifficulty NumRatings
6.5 20
6.4 3
6.2 3
6.8 4
6.2 1
6.0 1
4.0 25
Basically I want results that are close to the preferred difficulty at the top, but I'd rather show results that are 0.1 off with lots of ratings over exact matches with very few ratings. I understand getting "right" results may not be very concrete in this case, I'm just looking for a starting point.
Thanks for the help!
You have a bit of a nebulous question. You need some way to combine the rating and the count. The basic query is:
order by abs(avg_difficulty - 6.4)
But, you want to include the count as well. You can define a fixed band at the top and order these by the ranking:
order by (case when abs(avg_difficulty - 6.4) < 0.1) then numratings end) desc,
abs(avg_difficulty - 6.4)
This expression combines everything from 6.3 - 6.5 in one group and sorts these by the number of ratings. The second key sorts everything else by the difference.

Real world examples of a kind of "sorted" data

Consider a sorted list of numbers which is "cut," so that it is increasing except for one jump. For instance the order might be,
11, 12, 13, 14, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
What kinds of data naturally have this representation, with one or possibly many "cuts" obscuring the default ordering? The only one I can think of is a deck of cards, but I was asked to produce examples of data that might look like this in an interview. Weeks later, and I still can't think of any, but my curiosity prevails.
Is there a special name for this kind of data? I tried googling "cut data" but that obviously didn't work.
All insight is appreciated.
[Edit] From the discussions below this appears to have some interesting relationships with symmetry groups, and what sorts of rearrangements are possible with just the cut operation. I may have to ask my local mathematicians what I can do with this.
I can think of a few off the top of my head.
The first is the hour of the day as it rolls into a new day: ... 22 23 0 1 2 ....
The second is the alpha ordering on file names: pax1 pax10 pax11 ... pax19 pax2 pax20 ....
Yet another is the months of the financial year (in Australia, most companies close off their financial year at the end of June): 7 8 9 10 11 12 1 2 3 4 5 6.
After a quick analysis, it's obvious to see that any sequence of "cuts" results in a single cut with respect to a different index. In fact, it is only the most recent cut point that matters, as that value will end up at the front of the list, and it will be equivalent to a cut of this data from the original index of that element.
So not so interesting.