I'm making a game that involves people downloading and rating user-created maps. They have the option to upvote/downvote the map if they like/dislike it, as well as rate its difficult on a scale of 1-10.
In the map browser, they have the option to sort maps by highest rated. This is done using Laplace smoothing so that it factors both the number of upvotes as well as the total number of ratings into the sorting, sorting by (upvotes + 1)/(numRatings + 2). This works fine.
Now, there's also an option to sort by preferred difficulty, where people can choose a value from 1-10, and it will sort maps by how close the average difficulty rating is to the preferred rating. At first I was sorting by ABS(preferred_difficulty - average_difficulty), but that didn't factor in the number of ratings. Right now I'm using ((numRatings + 1) * (10 - ABS(average_difficulty - preferred_difficulty)) + 1) / (numRatings + 1.5) out of sheer trial and error, which kinda works, but sometimes the number of ratings outweighs the preferred difficulty and the results look strange.
This is what I need help with - I can't figure out how to sort by smallest difference between preferred and average difficulty while incorporating number of ratings into the mix, since I want a low difficulty delta with a high rating count to be the best result instead of a high upvote count with a high rating count, like it is with the ratings.
For example, if this is the data:
AvgDifficulty NumRatings
6.0 1
4.0 25
6.8 4
6.2 3
6.5 20
6.2 1
6.4 3
And someone chooses a preferred difficulty of 6.4, I'd want it to sort something like this:
AvgDifficulty NumRatings
6.5 20
6.4 3
6.2 3
6.8 4
6.2 1
6.0 1
4.0 25
Basically I want results that are close to the preferred difficulty at the top, but I'd rather show results that are 0.1 off with lots of ratings over exact matches with very few ratings. I understand getting "right" results may not be very concrete in this case, I'm just looking for a starting point.
Thanks for the help!
You have a bit of a nebulous question. You need some way to combine the rating and the count. The basic query is:
order by abs(avg_difficulty - 6.4)
But, you want to include the count as well. You can define a fixed band at the top and order these by the ranking:
order by (case when abs(avg_difficulty - 6.4) < 0.1) then numratings end) desc,
abs(avg_difficulty - 6.4)
This expression combines everything from 6.3 - 6.5 in one group and sorts these by the number of ratings. The second key sorts everything else by the difference.
Related
I'm trying to understand a little more about how Octave calculates quartiles and interquartile range. Consider the following:
A=[1 4 7 10 14];
quantile(A, [0.25 0.75])
ans = 3.2500 11.0000
This result seems consistent with Method 3 on the Wikipedia page about quartiles. Given that the interquartile range is Q3-Q1, I'd expect the result to be 7.75.
However, running iqr(A) gives a result of 6. Clearly this is calculated from 10 minus 4 from the original data, which is consistent with Method 2 from the same Wikipedia page.
What is the reason for using two different methods for calculating Q1 and Q3?
I have a MySQL database I'm searching through. Lets say this is a database of people. When querying for a specific record, it is possible to find a match 100% on each attribute. But querying the database to find closest match on probability (closest matches on table attributes) is more of the strategy.
In this scenario, does it make sense to create a temporary table (much like a tally-sheet) to indicate what attributes match/what attributes are present? What is the typical approach to doing advanced searches on database like this?
Example (below) of a hypothetical stored Procedure
*parameters are just to exemplify how I would search. I'm not concerned how to perform my selects. Question is about approach, strategy, technique *
call FindPerson ("Brown Eyes", "Brown hair", "Height:6'1", "white", "Name:Joe" ,"weight180", "Age 34" "sex m");
RESULT TABLE
NAME AGE HEIGHT WEIGHT HAIR SKIN sex RANK_MATCH
Joe 32 6'1 180 Brown white m 1
Mike 33 6'1 179 Brown white m 2
James 31 6'0 179 Brown black m 3
Just out of my mind. You can create your own score and sort by it. Something like
SELECT `id`,
(IF(`age`=32,1,0)+IF(`height`="6'1",1,0)+...) as `score`
FROM `people`
HAVING `score` > 0
ORDER BY `score` DESC
LIMIT 10;
With this, you can handle every field with its own comparison, and also weight the individual attributes by not just add 1 but 2 or more.
But I'm quiet not sure, how performant this is.
The approach I would use would be to create a scoring function (your stored proc) that would evaluate the given input's standard distance from the mean.
In the proc, you would judge each criteria in a fashion similar to:
INPUT AGE: 32
calculate MEAN of AGE WHERE (sex = m): 34.5
calculate STANDARD DEVIATION of AGE WHERE (sex = m): 2.5
calculate how many STDEVs 32 is from the 34.5 (also known as z-score): 1
Repeat this process for all numeric datatypes, summing them and ORDER BY the sum.
In doing so, the following schema change would be required: height changed from foot/inch form to strictly inches.
Depending on your needs, you may also consider coming up with an arbitrary scale for sex and skin color/hair color. Of course, you may think that measures like these should NOT be factored in because of how drastically it would change the scoring function. If you chose to, you'd have to find some number that would be added to the above SUM...but it's hard because nominative variables don't translate easily into these kinds of things.
If you find that haircolor/skin color is able to be usefully transferred into say, the continous color spectrum, your scoring tidbit would be the same...color value of input vs color value of means and standard deviations.
The query that would find your matches would be something to the effect of:
SELECT
ABS(INPUT_AGE - AVG(AGE)) / STD(AGE) AS age_z,
ABS(INPUT_WT - AVG(WT)) / STD(WT) AS wt_z,
...
(age_z + wt_z + ...) AS score
FROM `table`
ORDER BY score ASC
(I wish my mathematical vocabulary was more developed)
I have a website. On that website is a video. As a user watches the video, a bit of javascript stores how far they have gotten so far in the video. When they stop watching the video, that number of seconds is stored. There's no pattern to when the js will do this, unfortunately.
So if one person is watching the video, we might see this set:
3
6
8
10
12
16
And another person might get bored immediately:
1
3
This data is all stored in the same place, anonymously. So the sorted table with all this info would look like this:
1
3
3
6
8
10
12
16
Finally, the amount of times the video is started at all is stored. In this case it would be 2.
So. How do I get the average 'high-time' (the farthest reached point in the video) for all of the times the video was played?
I know that if we had a value for every second:
1
2
3
4
5
6
7
...
14
15
16
1
2
3
Then we could count up the values and divide by the number of plays:
(19) / 2 = 9.5
Or if the data was otherwise uniform, say in increments of 5, then we could count that up and multiply it by 5 (in the example, we would have some loss of precision, but that's ok):
5
10
15
5
(4) * 5 / 2 = 10
So it seems like I have a general function which would work:
count * 1/d = avg
where d is the density of the numbers (in the example above with 5 second increments, 1/5).
Is there a way to derive the density, d, from a set of effectively random numbers?
Why not just keep the last time that has been provided, and average across those? If you either throw away, or only pay attention to, the last number, it seems like you could just average over these.
You might also want to check out the term standard deviation as the raw average of this might not be the most useful measurement. If you have the standard deviation as well, it could help you realize that you have an average of 7, but it is composed of mostly 1's and 15's.
If you HAVE to have all the data, like you suggested, I will try and think about this a little bit more. I'm not totally certain how you can associate a value with all the previous values that came with it. Do you ALWAYS know the sequence by which numbers are output? If so, I think I know of a way you could derive the 'last' one, which might be slightly computationally expensive.
If you only have a sequence of integers, I think you may be able to increase each value (exponentially?) to 'compensate' for the fact that a later value 'contains' earlier values. I'm still working through this idea, but maybe it will give someone else a seed. What if you average over the sum of these, and then take the base2 logarithm of this average? Does that provide any kind of useful metric? That should 'weight' the later values to the point where they compensate for the sum of earlier values. I think.
In python-esk:
sum = 0
numberOf = 0
for node in nodes:
sum = sum + node.value ^ 2
numberOf = numberOf + 1
weightedAverage = log(sum/numberOf, 2)
print weightedAverage
print "Thanks Brian"
I think that #brian-stiner is on the right track in one of his comments.
Start with something like:
1
3
3
6
8
10
12
16
Turn that into numbers and counts.
1, 1
3, 2
6, 1
8, 1
10, 1
12, 1
16, 1
And then reading from the end down, find all of the points that happened more often than any remaining ones.
3, 2
16, 1
Take differences in counts.
3, 1
16, 1
And you have an estimate of stopping places.
This will not be an unbiased estimate. But if the JavaScript is independently inconsistent and the number of people is large, the biases should be fairly small.
It won't be right, but it will be close enough for government work.
Assuming increments are always around 5, some missing, some a bit longer or shorter. Then it won't be easy (possible?) to do this exactly. My suggestion: compute something like a 'moving count'. Similar to moving average.
So, for second 7: count how many numbers are 5,6,7,8 or 9 and divide by 5. That will give you a pretty good guess of how many people watched the 7th second. Do the same for second 10. The difference would be close to the number of the people who left between second 7 and 10.
To get the total time watched for each user, you'll have parse the list smallest to largest. If you have 4 views, you'll go through your list until you find that you no longer have 4 identical numbers, the last number where you had 4 identical numbers is the maximum of the first view. Then you'll look for when the 3 identical numbers stop, and so on. For example:
4 views data:
1111222233334445566778
4 views side by side:
1 1 1 1
2 2 2 2
3 3 3 3 <- first view max is 3 seconds
4 4 4 <- second view max is 4 seconds
5 5
6 6
7 7 <- third view max is 7 seconds
8 <- fourth view max is 8 seconds
EDIT- Oh, I just noticed that they are not uniform. In that case, the moving average would probably be your best bet.
The number of values roughly corresponds to the number of time periods in which your javascript sends the values (minus 1/2 if the video stop is accompanied with a obligatory time posting, since its moment is random within the interval).
If all clients have similar intervals and you know them, you may just use:
SELECT (COUNT(*) - 0.5) * 5.0 / (SELECT counter FROM countertable)
FROM ticktable
5.0 is the interval between the posts here.
Note that it does not even look at the values: you could as well just store "ticks".
For the max time, you could use MAX() on your field. Perhaps something like...
SELECT MAX(play_time) AS maxTime FROM video
Which would give you the longest time someone has played the video for.
If you want other things, like AVG() then you'll need more complex queries, for collecting on a per-user basis etc etc.
MySQL also contains a Standard Deviation function called STDDEV() and STD() which could help you too.
I have to track the stock of individual parts and kits (assemblies) and can't find a satisfactory way of doing this.
Sample bogus and hyper simplified database:
Table prod:
prodID 1
prodName Flux capacitor
prodCost 900
prodPrice 1350 (900*1.5)
prodStock 3
-
prodID 2
prodName Mr Fusion
prodCost 300
prodPrice 600 (300*2)
prodStock 2
-
prodID 3
prodName Time travel kit
prodCost 1200 (900+300)
prodPrice 1560 (1200*1.3)
prodStock 2
Table rels
relID 1
relSrc 1 (Flux capacitor)
relType 4 (is a subpart of)
relDst 3 (Time travel kit)
-
relID 2
relSrc 2 (Mr Fusion)
relType 4 (is a subpart of)
relDst 3 (Time travel kit)
prodPrice: it's calculated based on the cost but not in a linear way. In this example for costs of 500 or less, the markup is a 200%. For costs of 500-1000 the markup is 150%. For costs of 1000+ the markup is 130%
That's why the time travel kit is much cheaper than the individual parts
prodStock: here is my problem. I can sell kits or the individual parts, So the stock of the kits is virtual.
The problem when I buy:
Some providers sell me the Time Travel kit as a whole (with one barcode) and some sells me the individual parts (with a different barcode)
So when I load the stock I don't know how to impute it.
The problem when I sell:
If I only sell kits, calculate the stock would be easy: "I have 3 Flux capacitors and 2 Mr Fusions, so I have 2 Time travel kits and a Flux Capacitor"
But I can sell Kits or individual parts. So, I have to track the stock of the individual parts and the possible kits at the same time (and I have to compensate for the sell price)
Probably this is really simple, but I can't see a simple solution.
Resuming: I have to find a way of tracking the stock and the database/program is the one who has to do it (I cant ask the clerk to correct the stock)
I'm using php+MySql. But this is more a logical problem than a programing one
Update: Sadly Eagle's solution wont work.
the relationships can and are recursive (one kit uses another kit)
There are kit that does use more than one of the same part (2 flux capacitors + 1 Mr Fusion)
I really need to store a value for the stock of the kit. The same database is used for the web page where users want to buy the parts. And I should show the avaliable stock (otherwise they wont even try to buy). And can't afford to calculate the stock on every user search on the web page
But I liked the idea of a boolean marking the stock as virtual
Okay, well first of all since the prodStock for the Time travel kit is virtual, you cannot store it in the database, it will essentially be a calculated field. It would probably help if you had a boolean on the table which says if the prodStock is calculated or not. I'll pretend as though you had this field in the table and I'll call it isKit for now (where TRUE implies it's a kit and the prodStock should be calculated).
Now to calculate the amount of each item that is in stock:
select p.prodID, p.prodName, p.prodCost, p.prodPrice, p.prodStock from prod p where not isKit
union all
select p.prodID, p.prodName, p.prodCost, p.prodPrice, min(c.prodStock) as prodStock
from
prod p
inner join rels r on (p.prodID = r.relDst and r.relType = 4)
inner join prod c on (r.relSrc = c.prodID and not c.isKit)
where p.isKit
group by p.prodID, p.prodName, p.prodCost, p.prodPrice
I used the alias c for the second prod to stand for 'component'. I explicitly wrote not c.isKit since this won't work recursively. union all is used rather than union for effeciency reasons, since they will both return the same results.
Caveats:
This won't work recursively (e.g. if
a kit requires components from
another kit).
This only works on kits
that require only one of a particular
item (e.g. if a time travel kit were
to require 2 flux capacitors and 1
Mr. Fusion, this wouldn't work).
I didn't test this so there may be minor syntax errors.
This only calculates the prodStock field; to do the other fields you would need similar logic.
If your query is much more complicated than what I assumed, I apologize, but I hope that this can help you find a solution that will work.
As for how to handle the data when you buy a kit, this assumes you would store the prodStock in only the component parts. So for example if you purchase a time machine from a supplier, instead of increasing the prodStock on the time machine product, you would increase it on the flux capacitor and the Mr. fusion.
This is for a new feature on http://cssfingerprint.com (see /about for general info).
The feature looks up the sites you've visited in a database of site demographics, and tries to guess what your demographic stats are based on that.
All my demgraphics are in 0..1 probability format, not ratios or absolute numbers or the like.
Essentially, you have a large number of data points that each tend you towards their own demographics. However, just taking the average is poor, because it means that by adding in a lot of generic data, the number goes down.
For example, suppose you've visited sites S0..S50. All except S0 are 48% female; S0 is 100% male. If I'm guessing your gender, I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%. The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
For extra credit: what's the best way to calculate this, that is also cheap & easy to do in mysql?
ETA: I think that something approximating what I want is Φ(AVG(z-score ^ 2, sign preserved)). But I'm not sure if this is a good weighting function.
(Φ is the standard normal distribution function - http://en.wikipedia.org/wiki/Standard_normal_distribution#Definition)
A good framework for these kinds of calculations is Bayesian inference. You have a prior distribution of the demographics - eg 50% male, 37% childless, etc. Preferrably, you would have it multivariately: 10% male childless 0-17 Caucasian ..., but you can start with one-at-a-time.
After this prior each site contributes new information about the likelihood of a demographic category, and you get the posterior estimate which informs your final guess. Using some independence assumptions the updating formula is as follows:
posterior odds = (prior odds) * (site likelihood ratio),
where odds = p/(1-p) and the likelihood ratio is a multiplier modifying the odds after visiting the site. There are various formulas for it, but in this case I would just use the above formula for the general population and the site's population to calculate it.
For example, for a site that has 35% of its visitors in the "under 20" agegroup, which represents 20% of the population, the site likelihood ratio would be
LR = (0.35/0.65) / (0.2/0.8) = 2.154
so visiting this site would raise the odds of being "under 20" 2.154-fold.
A site that is 100% male would have an infinite LR, but you would probably want to limit it somewhat by, say, using only 99.9% male. A site that is 50% male would have an LR of 1, so it would not contribute any information on gender distribution.
Suppose you start knowing nothing about a person - his or her odds of being "under 20" are 0.2/0.8 = 0.25. Suppose the first site has an LR=2.154 for this outcome - now the odds of being "under 20" becomes 0.25*(2.154) = 0.538 (corresponding to the probability of 35%). If the second site has the same LR, the posterior odds become 1.16, which is already 54%, etc. (probability = odds/(1+odds)). At the end you would pick the category with the highest posterior probability.
There are loads of caveats with these calculations - for example, the assumption of independence likely being wrong, but it can provide a good start.
The naive Bayesian formula for you case looks like this:
SELECT probability
FROM (
SELECT #apriori := CAST(#apriori * ratio / (#apriori * ratio + (1 - #apriori) * (1 - ratio)) AS DECIMAL(30, 30)) AS probability,
#step := #step + 1 AS step
FROM (
SELECT #apriori := 0.5,
#step := 0
) vars,
(
SELECT 0.99 AS ratio
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
) q
) q2
ORDER BY
step DESC
LIMIT 1
Quick 'n' dirty: get a male score by multiplying the male probabilities, and a female score by multiplying the female probabilities. Predict the larger. (Actually, don't multiply; sum the log of each probability instead.) I think this is a maximum likelihood estimator if you make the right (highly unrealistic) assumptions.
The standard formula for calculating the weighted mean is given in this question and this question
I think you could look into these approaches and then work out how you calculate your weights.
In your gender example above you could adopt something along the lines of a set of weights {1, ..., 0 , ..., 1} which is a linear decrease from 0 to 1 for gender values of 0% male to 50% and then a corresponding increase up to 100%. If you want the effect to be skewed in favour of the outlying values then you easily come up with a exponential or trigonometric function that provides a different set of weights. If you wanted to then a normal distribution curve will also do the trick.