I'm developing a new Website where there's some "entities" to vote.
Every vote could be a number between 1 and 5 where 1 is the worst vote and 5 is the best vote.
Now, in the same website I have a "Popular entities chart" where I list the most popular "entities" based on their vote.
Now, I can't do a simply arithmetic average because an "entity" with one vote of 5 could have the same ranking as an "entity" with 100 votes of 5.
I thought about storing for every "entity" not ony the arithmetic average but also the numbers of votes and doing an SQL Query where I order by number of votes and arithmetic average but seems that after this, an entity with many votes of 1 could get popularity (when it isn't popular).
What algorithm could I use?
For a basic solution try order by [average vote] desc, [vote count] desc this way out of two entities with the same average, the one with 100 votes will go above the one with 1 vote, but one with average of 4.5 will never go above one with average of 5.
Edit 1
If you want 100 vote average of 4.5 to win against 10 vote average of 5, why not count votes ignoring 1, 2 and 3, or [count of votes 4 and 5] - [count of votes 1 and 2]? This way count of positive votes would pull entities up in ranking.
Edit 2
You might want to give extra importance to recent votes. Something might have changed about an entity that changed user opinion of it. Could build another average of votes made last month and adjust final ranks based on it.
Edit 3
What about calculate a [popularityScore] column and just order by it?
-- sum instead of average
-- square root of sum will reduce importance of vote count a bit
select
entity,
sqrt(sum(vote - 3)) as popularityScore
from Votes
group by entity
order by rank desc
-- 50 votes of 5 -> popularityScore = 12.25
-- 100 votes of 4 -> popularityScore = 10
-- 200 votes of 4 -> popularityScore = 14.14
-- 2000 votes of 4 -> popularityScore = 44.72
-- 2000 votes of 5 -> popularityScore = 63.25
-- 100000000 votes of 3 -> popularityScore = 0
Could calculate same score for last month and add it to this value.
Related
I have a data set with orders of tickets. Tickets can be bought in packs of 5, or 3, as well as individually. I need to group the data using the quantity of tickets sold per order, to determine if it was a 5 pack (divisible by five), then 3 pack, or else/then individually (1 or 2 qty). So if I have a quantity of 27, I know that order consisted of five "5 packs", and 2 individual tickets.
SUM(CASE WHEN (id % 5) = 0 THEN 1 ELSE 0 END) fivepack
I have this in my query, but stringing these together for fivepack, and threepack, doesn't eliminate the starting number from the total quantity on the next operation. So a quantity of 27, would yield a result of 5 "five packs" and 9 "three packs", and then 27 "individuals".
So given a quantity, how would you first divide by a large factor, get the remainder and divide by the smaller, then finally handle the remainder?
Edit:
The sample packs provide a discount of the purchase price(not relevant to the technical issue), so the first maximum division needs to occur first. So as Gordon Linoff asked below, in the case of 27 tickets quantity, you would take the maximum number of 5 divisions first, then pass the remainder to try to divide by 3, and then return the final remainder as individuals.
The issue is passing the value of one operation in SQL to the next operation, so so on. So I can do Math1, pass Answer1 to Math2, and then pass Answer2 to Math3.
I don't fully understand why 27 would be 5 five packs and 2 individuals rather than any of the following:
27 individuals
9 3-packs
4 5-packs, 2 3-packs, 1-individual
8 3-packs and 3 individuals
and so on.
But, if you want a greedy approach, you can use the following arithmetic:
select floor(num / 5) as five_packs,
floor( (num - 5 * floor(num / 5)) / 3) as three_packs,
num - 5 * floor(num / 5) - 3 * floor( (num - 5 * floor(num / 5)) / 3) as singles
Here is a SQL Fiddle illustrating the logic.
I have a timestamp column which has following time entries...i m writing as alphabets for convenience.
Person Time
1 A
2 B
3 C
4 D
5 E
5 F
5 G
6 H
Now the objective is to group all the entries that have a time difference of less than 2 hours between them which are generated by the same person and the count of elements in that group.
And so if i had say 100 entries....first if i were to consider 10 out of 100 entries then i need to check whether all 10 entries are from the same person then check if first 10 had time difference of less than 2 hours between successive elements...if so then the count is 10....if the time difference between 10th and 11th was more...then 11th wont be counted....and if the successive elements were generated by different person...then they cannot be grouped for calculating count.
so principally its like grouping successive entries which fits this criteria and dividing the table into sets (not breaking the table just grouping) and find out which set has the max count for a person.....so if 86 to 100th entry fit the criteria...then the count is 15 provided 86 to 100 are all generated by same person and if every other set had less than 10...then the output of the query should be the person which provided this max time count
I found it hard to find a fitting title. For simplicity let's say I have the following table:
cook_id cook_rating
1 2
1 1
1 3
1 4
1 2
1 2
1 1
1 3
1 5
1 4
2 5
2 2
Now I would like to get an output of 'good' cooks. A good cook is someone who has a rating of at least 70% of 1, 2 or 3, but not 4 or 5.
So in my example table, the cook with id 1 has a total of 10 ratings, 7 of which have type 1, 2 and 3. Only three have type 4 or 5. Therefore the cook with id 1 would be a 'good' cook, and the output should be the cook's id with the number of good ratings.
cook_id cook_rating
1 7
The cook with id 2, however, doesn't satisfy my condition, therefore should not be listed at all.
select cook_id, count(cook_rating) - sum(case when cook_rating = 4 OR cook_rating = 5 then 1 else 0 end) as numberOfGoodRatings from cook
where cook_rating in (1,2,3,4,5)
group by cook_id
order by numberOfGoodRatings desc
However, this doesn't take into account the fact that there might be more 4 or 5 than good ratings, resulting in negative outputs. Plus, the requirement of at least 70% is not included.
You can get this with a comparison in your HAVING clause. If you must have just the two columns in the result set, this can be wrapped as a sub-select select cook_id, positive_ratings FROM (...)
SELECT
cook_id,
count(cook_rating < 4 OR cook_rating IS NULL) as positive_ratings,
count(*) as total_ratings
FROM cook
GROUP BY cook_id
HAVING (positive_ratings / total_ratings) >= 0.70
ORDER BY positive_ratings DESC
Edit Note that count(cook_rating < 4) is intended to only count rows where the rating is less than 4. The MySQL documentation says that count will only count non-null rows. I haven't tested this to see if it equates FALSE with NULL but I would be surprised it it doesn't. Worst case scenario we would need to wrap that in an IF(cook_rating < 4, 1,NULL).
I suggest you change a little your schema to make this kind of queries trivial.
Suppose you add 5 columns to your cook table, to simply count the number of each ratings :
nb_ratings_1 nb_ratings_2 nb_ratings_3 nb_ratings_4 nb_ratings_5
Updating such a table when a new rating is entered in DB is trivial, just as would be recomputing those numbers if having redundancy makes you nervous. And it makes all filterings and sortings fast and easy.
I appreciate that this may appear to many as a dum question but I cannot find a clear explanation anywhere as to what the effect of "group by" has on a select max(...) from SQL statement.
I have the following data (there is another column image of type mediumblob which is not shown):
id title test_id
1 bomb 0
2 Soft watch 2
3 Dali 1
4 Narciss 1
5 The Woman In Green 0
6 A summer in Vetheuil 0
7 Artist's Garden 2
8 Beech Forest 2
9 Claude Monet 0
I know if I perform
select max(id) from images
where image is not null;
I get the max value of id i.e.:
max(id)
9
However can someone please explain what is happening when I perform
select max(id), title, test_id
from images
where image is not null
group by id;
I find that the max(id) serves no useful purpose (results shown below)?
max(id) title test_id
1 bomb 0
2 Soft watch 2
3 Dali 1
4 Narciss 1
5 The Woman In Green 0
6 A summer in Vetheuil 0
7 Artist's Garden 2
8 Beech Forest 2
9 Claude Monet 0
In the case of using MAX() the GROUP BY clause essentially tells the query engine how to group the items from which to determine a maximum. In your first example you were selecting only a single column, so there was no need for grouping. But in your second example you had multiple columns. So you need to tell the query engine how to determine which ones are going to be compared to find a maximum.
You told it to group by the id column. Which means that it's going to compare records which have the same id and give you the maximum one for each unique id. Since every record has a different id, you essentially didn't do anything with that clause.
It grouped all records with an id of 1 (which was a single record), and returned the record with the maximum id from that group (which was that record). It did the same for 2, 3, etc.
In the case of the three columns shown here, the only place where it would make sense to group your records would be on the test_id column. Something like this:
SELECT MAX(id), title, test_id
FROM images
WHERE image IS NOT null
GROUP BY test_id
This would group them by the test_id, so the results will include records 6 (the maximum id for test_id 0), 4 (the maximum id for test_id 1), and 8 (the maximum id for test_id 2). By splitting the records into those three groups based on the three unique values in the test_id column, it can effectively find a "maximum" id within each group.
Yes, in your example it serves no useful purpose.
You're grouping by ID then finding the maximum ID. But that doesn't make sense since there's only one of each ID. Normally MAX() is used on quantities, like prices or item counts or such like.
Group by is not used for this kind of queries
Its is used for queries like this
OId OrderDate OrderPrice Customer
1 2008/11/12 1000 Hansen
2 2008/10/23 1600 Nilsen
3 2008/09/02 700 Hansen
4 2008/09/03 300 Hansen
5 2008/08/30 2000 Jensen
6 2008/10/04 100 Nilsen
Now if you want to get sum of material bought by each customer of these you will use group by
SELECT Customer,SUM(OrderPrice) FROM Orders
GROUP BY Customer
customer SUM(OrderPrice)
Hansen 2000
Nilsen 1700
Jensen 2000
In above case id is unique so group by id will not make any sense
I need help with a MySQL query. We have a database (~10K rows) which I have simplified down to this problem.
We have 7 truck drivers who visit 3 out of a possible 9 locations, daily. Each day they visit exactly 3 different locations and each day they can visit different locations than the previous day. Here are representative tables:
Table: Drivers
id name
10 Abe
11 Bob
12 Cal
13 Deb
14 Eve
15 Fab
16 Guy
Table: Locations
id day address driver.id
1 1 Oak 10
2 1 Elm 10
3 1 4th 10
4 1 Oak 16
5 1 4th 16
6 1 Toy 16
7 1 Toy 11
8 1 5th 11
9 1 Law 11
10 2 Oak 11
11 2 4th 11
12 2 Toy 11
.........
We have data for a full year and we need to find out how many times each "route" is visited over a year, sorted from most to least.
From my high school math, I believe there are 9!/(6!3!) route combinations, or 84 in total. I want do something like:
Get count of routes where route addresses = 'Oak' and 'Elm' and '4th'
then run again
where route addresses = 'Oak' and 'Elm' and '5th'
then again and again, etc.Then sort the route counts, descending. But I don't want to do it 84 times. Is there a way to do this?
I'd be looking at GROUP_CONCAT
SELECT t.day
, t.driver
, GROUP_CONCAT(t.address ORDER BY t.address)
FROM mytable t
GROUP
BY t.day
, t.driver
What's not clear here, if there's an order to the stops on the route. Does the sequence make a difference, and how to we tell what the sequence is? To ask that a different way, consider these two routes:
('Oak','Elm','4th') and ('Elm','4th','Oak')
Are these equivalent (because it's the same set of stops) or are they different (because they are in a different sequence)?
If sequence of stops on the route distinguishes it from other routes with the same stops (in a different order), then replace the ORDER BY t.address with ORDER BY t.id or whatever expression gives the sequence of the stops.
Some caveats with GROUP_CONCAT: the maximum length is limited by the setting of group_concat_max_len and max_allowed_packet variables. Also, the comma used as the separator... if we combine strings that contain commas, then in our result, we can't reliably distinguish between 'a,b'+'c' and 'a'+'b,c'
We can use that query as an inline view, and get a count of the the number of rows with identical routes:
SELECT c.route
, COUNT(*) AS cnt
FROM ( SELECT t.day
, t.driver
, GROUP_CONCAT(t.address ORDER BY t.address) AS route
FROM mytable t
GROUP
BY t.day
, t.driver
) c
GROUP
BY c.route
ORDER
BY cnt DESC