Determining the best result given two variables [closed] - mysql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I'm looking for a way to weight my results to get the "best" highest rated result.
I have a table consisting of rating (0-5), mentions and name.
I.E.
RATING MENTIONS NAME
2.5 15 Bob
4.4 14 Susan
1 60 John
5 2 Steve
Both mentions and rating are important so sorting by just rating won't get the desired results.
For this example; while Steve has the highest rating he has very little mentions so i'm not very confident that he is the "best" highest rated person. Susan has several mentions and a high rating so she should surpass Steve. John has a very low rating but lots of mentions, he should only surpass any of the other people if he has a ridiculous amount of mentions.
The ideal result would be something similar to
RATING MENTIONS NAME
4.4 14 Susan
5 2 Steve
2.5 15 Bob
1 60 John
Appreciate the help!

Simplest way to do this is
RATING * RATING * Mentions
Which would provide the following:
RATING MENTIONS NAME SCORE
2.5 15 Bob 93.75
4.4 14 Susan 271.04
1 60 John 60
5 2 Steve 50
It is a pretty simple way to 'weight' the value of the rating.
Obviously you can go more complex but I would think the above is sufficient and the Query is easy so I will let you try and work that out yourself if you like the method!
Obviously you can just add another RATING if you want a LOT of weight on the rating OR multiply it by a a fixed amount - but the squaring / POWER is key (you could try RATING ^ 2.5) (^ is POWER)

When I encounter this problem, I often take the approach of reducing the rating by one standard error. The formula for the standard error is:
standard deviation for the group / sqrt(group size)
If you had the standard deviation for each group, I would order them using:
order by (case when mentions > 1 then stdev / sqrt(mentions) end)
This is not as "punishing" as Evan Miller's suggestion (pointed to by Juergen). This is essentially taking a confidence interval more like 60% than 95%. Admittedly, my preference is a bit empirical (based on experience). However, there is an issue with multiple comparisons and you don't need to estimate the exact confidence interval -- you just need to know the relative ordering of them.
You can calculate the standard deviation using the function stdev().

Well, I'm not very good in statistic, but from your expected result, I believe you need to find the importance of each property.. Which one is more important than the other one, I think you can use equation below:
values = weight * RATING + (1-weight) * MENTIONS
You can play around with the weight value, till you got what you want.. For me 0.8 kind a make sense..
RATING MENTIONS NAME SCORE
4.4 14 Susan 6.32
2.5 15 Bob 5
5 2 Steve 4.4
1 60 John 2

Related

How to retrieve rows with particular number of different column values in SQL?

I am implementing an exam portal. I have teacher and student as users.
Teachers generate question set for particular subject for taking exam. He has 4 options based on exam full marks ( 20 marks, 50 marks, 80 marks and 100 marks ), duration is also fixed during marks selection as 30 min, 60 min, 90 min, and 120 mins.
I have a question table. That has field for question, answer and level(Easy, Medium, Hard). Easy question 1 marks, Medium level question 2 marks and Hard level question 3 marks.
Any number of questions could be added to the set by teacher. And questions will be fetched randomly from database. Also each student should get 6 easy, 4 medium and 2 hard level question in a set of 20 marks, i.e. total 12 questions to be fetched randomly following criteria of levels. Similarly, for 50 marks set, 15 EASY level, 10 MEDIUM level and 3 HARD level questions have to be fetched and so on.
Please avoid all other conditions you think must be present and help me with forming an mysql query or just help me with some clues of what sql clause should I use.
I'd use a union select two times additionally to first query so each select gets a set by question level.
From just the title, I would guess you are looking for
HAVING COUNT(col) = 7

MySQL finding what rank a number has in multiple columns

I've looked a lot on here to see if this has been asked but I couldn't find it. So I'm sorry if it has been answered before.
My question is very basic but for some reason I cannot seem to get it.
I have the following table:
ID Mike Carl Steve Josh
1 2$ 3$ 1$ 5$
2 4$ 5$ 1$ 2$
So what I need is to know what position out of all ID's does Mike rank from lowest to highest? This would mean for ID 1 it would yield position 2 because he is the second lowest. For ID 2 it would yield position 3 because he is the third lowest. This will go on for about 200000 positions for about 20 people but I just lowered it to simplify it.
Please let me know if you have any ideas and thank you so much in advance for all the help!
Just exported to excel and used rank on each column.

Using two columns to obtain count on another in SQL

I have a simple question that I wasn't really sure how to search for (or title!). I apologize if this has been asked a million times. For the following table, how do I generate a report that will detail the number of companies that a person has worked for and how many people have also worked for that same number? So, for example, this table should return:
people, companiesperperson
1, 1
2, 2
1, 3
for the following table called personalinfo:
id_number first last company
1 John Doe Intel
2 John Doe Microsoft
3 Phil Jenkins Amgen
4 Phil Jenkins Bayer
5 Phil Jenkins Sanofi
6 Josh Edwards Walgreens
7 Amy Dill URS
8 Amy Dill ARCADIS
Let me know if this is still confusing and if I can further clarify what I am looking to do.
Thanks!
This is a rough estimate of the query but
SELECT count as companiesperperson, COUNT(first, last) as people FROM
(SELECT COUNT(company) as count, first, last FROM personalinfo GROUP BY (first, last)) as a
GROUP BY count
To explain the query first in the subquery we are asking for the names and count of companies after splitting up all the rows by names
Then in the outer query we split up all the rows by their count and ask how many unique names can be found in each group.
There may be a few syntax errors I've left straggling but the group by feature is really what's essential to understanding how to solve this question.

Isolating unique observations and calculating the average in Stata

Currently I have a dataset that appears as follows:
mnbr firm contribution
1591 2 1
9246 6 1
812 6 1
674 6 1
And so on. The idea is that mnbr is the member number of employees who work at firm # whatever. If contribution is 1 (and I have dropped all the 0s for this purpose) said employee has contributed to a certain fund.
I additionally used codebook to determine the number of unique firms that exist. The goal is to determine the average number of contributions per firm i.e. there was 1 contribution for firm 2, 3 contributions for firm 6 and so on. The problem I arrive at is accessing that the unique values number from codebook.
I read some documentation online for
inspect *varlist*
display r(N_unique)
which suggests to me that using r(N_unique) would store that value, yet unfortunately this method did not work for me. So that is part 1.
Part 2 is I'd also like to create a variable that shows the contributions in each firm i.e.
mnbr firm contribution average
1591 2 1 1
9246 6 . 2/3
812 6 1 2/3
674 6 1 2/3
to show that for firm 6, 2 out of the 3 employees contributed to this fund.
Thanks in advance for the help.
To answer your comment, this works for me:
clear
set more off
input ///
mnbr firm cont
1591 2 1
9246 6 .
812 6 1
674 6 1
end
list
// problem 1
inspect firm
display r(N_unique)
// problem 2
bysort firm: egen totc = total(cont)
by firm: gen share = totc / _N
list
You have to use r(N_unique) before running another Stata command, or it can get lost. You can also save that result to a local or scalar.
Problem 2 is also addressed.

Selecting rows if the total sum of a row is equal to X

I have a table that holds items and their "weight" and it looks like this:
items
-----
id weight
---------- ----------
1 1
2 5
3 2
4 9
5 8
6 4
7 1
8 2
What I'm trying to get is a group where the sum(weight) is exactly X, while honouring the order in which were inserted.
For example, if I were looking for X = 3, this should return:
id weight
---------- ----------
1 1
3 2
Even though the sum of ids 7 and 8 is 3 as well.
Or if I were looking for X = 7 should return
id weight
---------- ----------
2 5
3 2
Although the sum of the ids 1, 3 and 6 also sums 7.
I'm kind of lost in this problem and haven't been able to come up with a query that does at least something similar, but thinking this problem through, it might get extremely complex for the RDBMS to handle. Could this be done with a query? if not, what's the best way I can query the database to get the minimum amount of data to work with?
Edit: As Twelfth says, I need to return the sum, regardless of the amount of rows it returns, so if I were to ask for X = 20, I should get:
id weight
---------- ----------
1 1
3 2
4 9
5 8
This could turn out to be very difficult in sql. What you're attempting to do is solve the knapsack problem, which is non-trivial.
The knapsack problem is interesting from the perspective of computer science for many reasons:
The decision problem form of the knapsack problem (Can a value of at least V be achieved without exceeding the weight W?) is NP-complete, thus there is no possible algorithm both correct and fast (polynomial-time) on all cases, unless P=NP.
While the decision problem is NP-complete, the optimization problem is NP-hard, its resolution is at least as difficult as the decision problem, and there is no known polynomial algorithm which can tell, given a solution, whether it is optimal (which would mean that there is no solution with a larger, thus solving the decision problem NP-complete).
There is a pseudo-polynomial time algorithm using dynamic programming.
There is a fully polynomial-time approximation scheme, which uses the pseudo-polynomial time algorithm as a subroutine, described below.
Many cases that arise in practice, and "random instances" from some distributions, can nonetheless be solved exactly.