I want to evaluate the acceptance rate of a proposal, where a proposal can receive two types of votes: namely positive and negative.
So the simplest function that comes to mind is as follows:
p + n / p + n + \epsilon
However I would like to come up with a more sophisticated function which would satisfy the following two properties.
The ratio of positive votes to the total amount of votes should always take precedence. So where p1 = 5, n1 = 0, p2 = 99, n2 = 1 the function should calculate a higher acceptance rate for the first one.
When the ratios are equal, the function should return a higher acceptance rate for the one with the higher number of total votes. So in the following case where p1 = 1000, n1 = 0, p2 = 10, n2 = 0 again the first one should have a higher acceptance rate.
Another idea concerning the function could be the following:
w * [p / (p + n)] + (1 - w) * [(p + n) / maxV]
where maxV is the maximum number of votes that any proposal received and w is a real number in [0..1].
This function satisfy the second condition whereas the guarantee does not extend to the first one. Finding a value for w to satisfy the system could be cumbersome thus I'm searching for a better solution.
Related
For a past week I been trying to construct specific functions that would satisfy some the requirements, sadly it was not successful. I decided to give a shot in stack overflow.
I have three shapes in 2D world as shown in the image. Shape A with center a, Shape B with center b and shape C with center c. For this problem shape form does not matter so I will not expand on that. Other important variables are d_ab with represents minimum distance between shape A and B, d_bc - distance between B and C, d_ac - distance between A and C. Also d_ab, d_bc, d_ac ranges from 0 to 1, where 1 represents shapes are touching and 0 their too far from specific threshold.
Here is the image of it
I am trying to find three new functions:
f_a(d_ab, d_bc, d_ac) = x
f_b(d_ab, d_bc, d_ac) = y
f_c(d_ab, d_bc, d_ac) = z
Firstly, these functions with inputs d_ab, d_bc, d_ac in the left side table should result to values x, y, z right side table:
d_ab
d_bc
d_ac
x
y
z
1
0
0
(a+b)/2
(a+b)/2
c
0
1
0
a
(b+c)/2
(b+c)/2
0
0
1
(a+c)/2
b
(a+c)/2
1
1
1
(a+b+c)/3
(a+b+c)/3
(a+b+c)/3
0
0
0
a
b
c
Secondly, all other inputs d_ab, d_bc, d_ac that are not described in table (From 0 to 1) should result in interpolated values between closest ensuring continuity (Continuity I am referring there is that small change in input should not produce big difference in output, there is probably better way to describe this mathematically).
For example:
f_a(0.3, 0, 0) = x
Most likely will combine equitation's from table:
f_a(0, 0, 0) = a
f_a(1, 0, 0) = (a+b)/2
Resulting into:
f_a(0.3, 0, 0) = 0.7 * a + 0.3 * (a+b)/2
I am curious is there a method to find these kind of functions, I assume there could be more than one that satisfies this requirement. But I am fine with any of it as long it has that continuity that I referred above.
I almost sure there is better way to represent this problem mathematically, but I am not that great with it.
I get a frequency of words I would like to convert number_of_occuerence to a number between 0-10.
word number_of_occurrence score
and 200 10
png 2 1
where 50 6
news 120 7
If you want to rate terms frequencies in a corpus, I suggest you to read this wikipedia article : Term frequency–inverse document frequency.
There are many ways to count the term frequency.
I understood want to rate it between 0 to 10.
I didn't get how you calculated you score values examples.
Anyway I suggest you an usual method: the log function.
#count the occurrences of you terms
freq_table = {}
words = tokenize(sentence)
for word in words:
word = word.lower()
#stem the word if you can, using nltk
if word in stopWords:#do you really want to count the occurrences of 'and'?
continue
if word in freq_table:
freq_table[word] += 1
else:
freq_table[word] = 1
#log normalize the occurrences
for wordCount in freq_table.values():
wordCount = 10*math.log(1+wordCount)
of course instead of log normalization you can use a normalization by the maximum.
#ratio max normalize the occurrences
max = max(freq_table.values())
for wordCount in freq_table.values():
wordCount = 10*wordCount/max
Or if you need a threshold effect,you can use a sigmoid function you could customize:
For more word processing check the Natural Language Toolkit. For a good term frequency count stematisation is good choice (stopwords are also useful)!
Score is between 0-10. The maximum score is 10 for occurence 50, therefore anything higher than that should also has score 10. On the other hand, minimum score is 0, while the score is 1 for occurence 5, so assume anything lower than that has score 0.
Interpolation is based on your given condition only:
If a word appear 50 times it should be closer to 10 and if a word
appear 5 times it should be closer to 1.
df['score'] = df['number_of_occurrence'].apply(lambda x: x/5 if 5<=x<=50 else (0 if x< 5 else 10))
Output:
I have a VARCHAR field that stores a value like 0.00000000.
I want to run a report query to SUM all those VARCHAR fields, which means I have to convert them to a number to add them.
Here's my query, which works as far as giving no errors, but it gives the wrong number back:
SELECT SUM(CAST(IFNULL(tx.received_amount, '0.00000000') AS DECIMAL(16, 16)))
FROM account
JOIN account_invoice
ON account_invoice.account_id = account.id
JOIN withdrawal
ON withdrawal.invoice_id = account_invoice.invoice_id
JOIN tx
ON tx.id = withdrawal.tx_id
AND tx.currency = 'BTC'
AND tx.created_at > DATE_SUB(NOW(), INTERVAL 7 DAY)
WHERE account.id = 1
This is what I get: 100 x 1.12345678 = 100.00000000
This is what I should get: 100 x 1.12345678 = 112.34567800
Why is the SUM not adding the numbers after the decimal?
You are not using the DECIMAL datatype accordingly to your use case. DECIMAL(16, 16) declares a decimal number with a total of 16 digits and with 16 decimal digits. This cannot hold a value greater than 1.
Consider:
SELECT CAST('1.12345678' AS DECIMAL(16, 16))
Returns: 0.9999999999999999.
You probably want something like DECIMAL(16, 8) instead, since your strings seem to have 8 decimals.
From the MySQL documentation:
The declaration syntax for a DECIMAL column is DECIMAL(M,D). The ranges of values for the arguments are as follows:
M is the maximum number of digits (the precision). It has a range of 1 to 65.
D is the number of digits to the right of the decimal point (the scale). It has a range of 0 to 30 and must be no larger than M.
GMB's answer is usually the best choice, but if you truly need to output a (a_really_precise_number)*100 you can do it application-side by actually passing it as a string into a language that supports arbitrarily large numbers, then cast it application side. If you have numbers more precise than 16 digits in your database, you are likely already using one that supports this in your application.
In some cases, you are looking at data from another source and you have more precise numbers than your language of choice is designed for. Many languages that don't support these larger numbers natively may have libraries available that do fancy parsing to perform math on strings as strings but they tend to be a bit slow if you need to work with really large numbers or data sets.
A third option if you are just multiplying it by a power of 10 such as N*100 and outputting the result is to pass it to the application as a string, then just parse it to move that decimal over 2 places like this:
function shiftDec(str, shift){
// split on decimal point
var decPoint = str.indexOf(".");
var decInt = str.substr(0, decPoint);
var decMod = str.substr((decPoint+1));
// move decimal 'shift' places to simulate N*100.
if(shift > 0){
var shiftCopy = decInt .substr(0,shift);
decInt = decInt + shiftCopy;
decMod = decMod .substr(shift);
} else {
var shiftCopy = decInt .substr((decInt.length + shift));
decInt = decInt .substr(0,(decInt.length + shift));
decMod = shiftCopy + decMod;
}
return decInt + '.' + decMod;
}
var result = shiftDec("1234567891234567.8912345678912345", 2);
document.write(result);
You should not use DECIMAL(16,16)
SELECT 100 * CAST('1.123' AS DECIMAL(16,16))
99.999...
SELECT 100 * CAST('1.123' AS DECIMAL(16, 10))
112.300...
I am currently trying to figure out how to calculate the similarity between two records. My first record would be from a deactivated advertisement - so I want to find e.g. the 10 most similar advertisement regarding to some VARCHAR-fields equalness.
The thing, I can't figure out is, if there is any MySQL function, that can help me in any way - or if I need to compare the strings in some weird way?
EDIT #1
Similarity would be defined by these fields:
Title (weight: 50 %)
Content (weight: 40 %)
Category (weight: 10 %)
EDIT #2
I want the calculation to be like this:
Title: Words that match in the title field (only words >2 letters are matched).
Description: Words that match in the title field (only words >2 letters are matched).
Catgory: Match the category and if that doesn't match match the parent category with less weight :)
An equation of this could be:
#1 is the old, inactive post, #2 is the active post:
#2 title matches #1 title in 3 words out of #2's total of 10 words.
That gives 30 % match = 30 points.
#2 description matches #1 description in 10 words out of #2's total
of 400 words. That gives a 4 % match = 4 points.
#2 category doesn't match #1's category, therefore 0 % match. That
gives 0 points.
Then the sum would be 34 points for #2. :)
Edit #3
Here's my query - but it doesn't return different rows, but a lot of the same row.
SELECT
a.AdvertisementID as A_AdvertisementID,
IF(a.Topic LIKE a2.Topic, 50, 0) + IF(a.Description LIKE a2.Description, 40, 0) + IF(a.Cate_CategoryID LIKE a2.Cate_CategoryID, 10, 0) as A_Score,
a.AdvertisementID as A_AdvertisementID,
a.Topic as A_Topic,
LEFT(a.Description, 300) as A_Description,
a.Price as A_Price,
a.Type as A_Type
FROM
".DB_PREFIX."A_Advertisements a2,
".DB_PREFIX."A_Advertisements a
WHERE
a2.AdvertisementID <> a.AdvertisementID
AND
a.AdvertisementID = :a_id
ORDER BY
A_Score DESC
If you can literally compare the fields you are interested in, you could have MySQL perform a simple scoring calculation using the IF() function, for example
select
foo.id,
if (foo.title='wantedtitle', 50, 0) +
if (foo.content='wantedcontent', 40, 0) +
if (foo.category='wantedcategory', 10, 0) as score
from foo
order by score desc
limit 10
A basic 'find a fragment' could be achieved using like
select
foo.id,
if (foo.title like '%wantedtitlefragment%', 50, 0) +
if (foo.content like '%wantedcontentfragment%', 40, 0) +
if (foo.category like '%wantedcategoryfragment%', 10, 0) as score
from foo
order by score desc
limit 10
There are other techniques, but they might be slow to implement in MySQL. For example, you could calculate the Levenstein distance between two string - see this post for an example implementation.
Please order the function belows by growth rate
n ^ 1.5
n ^ 0.5 + log n
n log ^ 2 n
n log ( n ^ 2 )
n log log n
n ^ 2 + log n
n log n
n
ps:
Ordering by growth rate means, as n gets larger and larger, which function will eventually be higher in value than the others.
ps2. I have ordered most of the functions:
n , n log log n, n log n, n log^2 n, n log ( n ^ 2 ), n ^ 1.5
I just do not know how to order:
n ^ 2 + log n,
n ^ 0.5 + log n,
these 2 values
Can anyone help me?
Thank you
You can figure this out fairly easily by graphing the functions and seeing which ones get larger (find a graphing calculator, check out Maxima, or try graphing the functions on Wolfram Alpha). Or, or course, you just pick some large value of n and compare the various functions, but graphs can give a bit of a better picture.
The key to the answer you seek is that when you sum two functions, their combined "growth rate" is going to be exactly that of the one with the higher growth rate of the two. So, you now know the growth rates of these two functions, since you appear (from knowing the correct ordering of all the others) to know the proper ordering of the growth rates that are in play here.
Plugging in a large number is not the correct way to approach this!
Since you have the order of growth, then you can use the following rules http://faculty.ksu.edu.sa/Alsalih/CSC311_10_11_01/3.3_GrowthofFunctionsAndAsymptoticNotations.pdf
In all of those cases, you're dealing with pairs of functions that themselves have different growth rates.
With that in mind, only the larger one really matters, since it will be most dominant even with a sum. So in each of those function sums, which is the bigger one and how does it compare to the other ones on your larger list?
If you need to proof mathematically, you should try something like this.
If you have two functions, e.g.:
f1(n) = n log n
f2(n) = n
You can simply find the limit of f3(n) = f1(n)/f2(n) when n tends to infinity.
If the result is zero, then f2(n) has a greater growth rate than f1(n).
On the other hand, if the result is infinity then f1(n) has a greater growth rate than f2(n).
n0.5 (or n1/2) is the square root of n. So, it grows more slowly than n2.
let say n = 4 then we get
n ^ 2 + log n = 16.6020599913
n ^ 1.5 = 8
n = 4
n log ( n ^ 2 ) = 4.81
n ^ 0.5 + log n = 2.60205999133
n log n = 2.4
n log ^ 2 n = ?
n log log n = -0.8