Two query methods give different results - mysql

I am two days into MySQL. While working on a database of movies, I decided to try and answer a specific question: what movies from a list of 1000 made a negative profit, but had an IMDB score between 8 and 9?
Ideally, I wanted to return only a list of movies that meet that criteria, but at first, I had not known the WHERE statement. So I used CASE, so that I could simply mark those movies with 'Quality Anomaly'.
Then, when I learned about WHERE, I implemented it. However, the results were different. The query using CASE only marked 6 movies, and the query using WHERE returned 51 movies (rows). I added the appropriate columns so I could check to make sure it was right, and it all checked out, but there is something weird. When I do a query of SELECT *, to return all movies, there are many missing from the complete list of 1000 that are in the list of 51 from the query using WHERE.
Here is the code:
/**
This query is supposed to return all movie titles while indicating
the profit from the movie, or 'Quality Anomaly' if profit is
negative while imdb score is between 8 and 9. It only indicates
'Quality Anomaly' for 6 movies.
**/
select
movie_title,
imdb_score,
case
when (gross < budget) and (imdb_score between 8 and 9) then 'Quality Anomaly'
else gross - budget
end as profit
from movies;
​
/**
The next query is supposed to only list the movies that meet
the 'Quality Anomaly' condition described above, but instead
of returning only 6 movies, it returns 51.
**/
select
movie_title,
(gross - budget) as profit,
imdb_score
from movies
where (gross < budget) and (imdb_score between 8 and 9);
Link to output data via github.

Related

MySQL query for an MLB database

There are 5 tables: mlb_batting, mlb_manager, mlb_master, mlb_pitching, mlb_team.
Find the top 10 (highest) “strike outs per walk” statistic for all pitches with at least 1 walk that played in at least 25 games. You should display their first name, last name, and K/BB statistic. K/BB is computed by dividing the number of strike outs by the number of walks (“base on balls”). You will need to use “limit” in MySQL (not talked about in class or notes – you will have to search how to do it). I would like this query done 2 different ways. One that only looks at the 25 games and 1 walk on a per stint basis. That is, if they played for two different teams (two different stints) then you would count those separate. And the other query should combine all the stints they had. That is, if they played for two different teams you would add up their games and walks.
My solution is:
SELECT NAME_FIRST, NAME_LAST, SUM(strikeouts) / SUM(walks) AS KS_PER_BB
FROM mlb_master
JOIN mlb_pitching
ON mlb_master.player_id = mlb_pitching.player_id
WHERE walks >= 1 AND games >= 25
GROUP BY name_first, name_last, mlb_pitching.stint
ORDER BY KS_PER_BB DESC
LIMIT 10;
I am wondering if this solution is better for the first way my professor wants it done or the second way, if any.
This solution is appropriate for the first query because by having GROUP BY stint, each stint is considered different for each player.
For the second way, could I remove the stint column from the GROUP BY clause so that it groups the records for a particular player together, regardless of the different stints they played for?
Would this result in the sum of all their walks and strikeouts from all their stints being used to calculate the KS_PER_BB statistic, giving you the combined total for each player?

count similar attribute value

I am working (practice) on a Data Set with below table structure. I have developed different queries and have been successful so far. However, two queries are troubling me and was wondering if you guys could point what I am doing wrong.
1: Need to extract top 10 AuthorAccountId that created the highest number of changes on the project 'abc'.
e.g returns the top 10 AuthorAccountId that created the highest number of changes on the project 'abc'. The query must return two columns: the authorId and the number of changes made each author.
Following is the query that I have developed but does not give me the desired result.
SELECT ch_authoraccountid,count(ch_project)
FROM t_change
WHERE ch_project LIKE 'abc'
ORDER BY ch_authoraccountId DESC
LIMIT 10
2: Return the names of authors who did not submit any change during the year 2017, ( this one is going to be sub query on t_change).
e.g Expected result should return the name of authors who did not bring any changes in the year 2017.
Following is the query
SELECT p_name
FROM t_people
WHERE p_accountid IN (SELECT ch_createdTime
FROM t_change
WHERE ch_createdTime != '2016-01-01')")
Ref : Yang, R. G. Kula, N. Yoshida and H. Iida, "Mining the Modern Code Review Repositories: A Dataset of People, Process and Product," 2016 IEEE/ACM 13th Working Conference on Mining SoftwareRepositories (MSR), Austin, TX, 2016, pp.
460-463.
https://github.com/kin-y/miningReviewRepo/wiki/Database-Schema
For the first query, try this :
SELECT ch_authoraccountid, COUNT(ch_project)
FROM t_change
WHERE ch_project = 'abc'
GROUP BY ch_authoraccountid
ORDER BY COUNT(ch_project) DESC
LIMIT 10
It will count the number of change in 'abc' project per authoraccountid.
For the second :
SELECT p_name
FROM t_people
WHERE p_accountid NOT IN (SELECT ch_authorAccountId
FROM t_change
WHERE ch_createdTime BETWEEN '20170101' AND '20171231')")

Relational Database Logic

I'm fairly new to php / mysql programming and I'm having a hard time figuring out the logic for a relational database that I'm trying to build. Here's the problem:
I have different leaders who will be in charge of a store anytime between 9am and 9pm.
A customer who has visited the store can rate their experience on a scale of 1 to 5.
I'm building a site that will allow me to store the shifts that a leader worked as seen below.
When I hit submit, the site would take the data leaderName:"George", shiftTimeArray: 11am, 1pm, 6pm (from the example in the picture) and the shiftDate and send them to an SQL database.
Later, I want to be able to get the average score for a person by sending a query to mysql, retrieving all of the scores that that leader received and averaging them together. I know the code to build the forms and to perform the search. However, I'm having a hard time coming up with the logic for the tables that will relate the data. Currently, I have a mysql table called responses that contains the following fields,
leader_id
shift_date // contains the date that the leader worked
shift_time // contains the time that the leader worked
visit_date // contains the date that the survey/score was given
visit_time // contains the time that the survey/score was given
score // contains the actual score of the survey (1-5)
I enter the shifts that the leader works at the beginning of the week and then enter the survey scores in as they come in during the week.
So Here's the Question: What mysql tables and fields should I create to relate this data so that I can query a leader's name and get the average score from all of their surveys?
You want tables like:
Leader (leader_id, name, etc)
Shift (leader_id, shift_date, shift_time)
SurveyResult (visit_date, visit_time, score)
Note: omitted the surrogate primary keys for Shift and SurveyResult that I would probably include.
To query you join shifts and surveys group on leader and taking the average then jon that back to leader for a name.
The query might be something like (but I haven;t actually built it in MySQL to verify syntax)
SELECT name
,AverageScore
FROM Leader a
INNER JOIN (
SELECT leader_id
, AVG(score) AverageScore
FROM Shift
INNER JOIN
SurveyResult ON shift_date = visit_date
AND shift_time = visit_time --depends on how you are recording time what this really needs to be
GROUP BY leader ID
) b ON a.leader_id = b.leader_id
I would do the following structure:
leaders
id
name
leaders_timetabke (can be multiple per leader)
id,
leader_id
shift_datetime (I assume it stores date and hour here, minutes and seconds are always 0
survey_scores
id,
visit_datetime
score
SELECT l.id, l.name, AVG(s.score) FROM leaders l
INNER JOIN leaders_timetable lt ON lt.leader_id = l.id
INNER JOIN survey_scores s ON lt.shift_datetime=DATE_FORMAT('Y-m-d H:00:00', s.visit_datetime)
GROUP BY l.id
DATE_FORMAT here helps to cut hours and minutes from visit_datetime so that it could be matched against shift_datetime. This is MYSQL function, so if you use something else you'll need to use different function
Say you have a 'leader' who has 5 survey rows with scores 1, 2, 3, 4 and 5.
if you select all surveys from this leader, sum the survey scores and divide them by 5 (the total amount of surveys that this leader has). You will have the average, in this case 3.
(1 + 2 + 3 + 4 + 5) / 5 = 3
You wouldn't need to create any more tables or fields, you have what you need.

MySQL query for items where average price is less than X?

I'm stumped with how to do the following purely in MySQL, and I've resorted to taking my result set and manipulating it in ruby afterwards, which doesn't seem ideal.
Here's the question. With a dataset of 'items' like:
id state_id price issue_date listed
1 5 450 2011 1
1 5 455 2011 1
1 5 490 2011 1
1 5 510 2012 0
1 5 525 2012 1
...
I'm trying to get something like:
SELECT * FROM items
WHERE ([some conditions], e.g. issue_date >= 2011 and listed=1)
AND state_id = 5
GROUP BY id
HAVING AVG(price) <= 500
ORDER BY price DESC
LIMIT 25
Essentially I want to grab a "group" of items whose average price fall under a certain threshold. I know that my above example "group by" and "having" are not correct since it's just going to give the AVG(price) of that one item, which doesn't really make sense. I'm just trying to illustrate my desired result.
The important thing here is I want all of the individual items in my result set, I don't just want to see one row with the average price, total, etc.
Currently I'm just doing the above query without the HAVING AVG(price) and adding up the individual items one-by-one (in ruby) until I reach the desired average. It would be really great if I could figure out how to do this in SQL. Using subqueries or something clever like joining the table onto itself are certainly acceptable solutions if they work well! Thanks!
UPDATE: In response to Tudor's answer below, here are some clarifications. There is always going to be a target quantity in addition to the target average. And we would always sort the results by price low to high, and by date.
So if we did have 10 items that were all priced at $5 and we wanted to find 5 items with an average < $6, we'd simply return the first 5 items. We wouldn't return the first one only, and we wouldn't return the first 3 grouped with the last 2. That's essentially how my code in ruby is working right now.
I would do almost an inverse of what Jasper provided... Start your query with your criteria to explicitly limit the few items that MAY qualify instead of getting all items and running a sub-select on each entry. Could pose as a larger performance hit... could be wrong, but here's my offering..
select
i2.*
from
( SELECT i.id
FROM items i
WHERE
i.issue_date > 2011
AND i.listed = 1
AND i.state_id = 5
GROUP BY
i.id
HAVING
AVG( i.price) <= 500 ) PreQualify
JOIN items i2
on PreQualify.id = i2.id
AND i2.issue_date > 2011
AND i2.listed = 1
AND i2.state_id = 5
order by
i2.price desc
limit
25
Not sure of the order by, especially if you wanted grouping by item... In addition, I would ensure an index on (state_id, Listed, id, issue_date)
CLARIFICATION per comments
I think I AM correct on it. Don't confuse "HAVING" clause with "WHERE". WHERE says DO or DONT include based on certain conditions. HAVING means after all the where clauses and grouping is done, the result set will "POTENTIALLY" accept the answer. THEN the HAVING is checked, and if IT STILL qualifies, includes in the result set, otherwise throws it out. Try the following from the INNER query alone... Do once WITHOUT the HAVING clause, then again WITH the HAVING clause...
SELECT i.id, avg( i.price )
FROM items i
WHERE i.issue_date > 2011
AND i.listed = 1
AND i.state_id = 5
GROUP BY
i.id
HAVING
AVG( i.price) <= 500
As you get more into writing queries, try the parts individually to see what you are getting vs what you are thinking... You'll find how / why certain things work. In addition, you are now talking in your updated question about getting multiple IDs and prices at apparent low and high range... yet you are also applying a limit. If you had 20 items, and each had 10 qualifying records, your limit of 25 would show all of the first item and 5 into the second... which is NOT what I think you want... you may want 25 of each qualified "id". That would wrap this query into yet another level...
What MySQL does makes perfectly sense. What you want to do does not make sense:
if you have let's say 4 items, each with price of 5 and you put HAVING AVERAGE <= 7 what you say is that the query should return ALL the permutations, like:
{1} - since item with id 1, can be a group by itself
{1,2}
{1,3}
{1,4}
{1,2,3}
{1,2,4}
...
and so on?
Your algorithm of computing the average in ruby is also not valid, if you have items with values 5, 1, 7, 10 - and seek for an average value of less than 7, element with value 10 can be returned just in a group with element of value 1. But, by your algorithm (if I understood correctly), element with value 1 is returned in the first group.
Update
What you want is something like the Knapsack problem and your approach is using some kind of Greedy Algorithm to solve it. I don't think there are straight, easy and correct ways to implement that in SQL.
After a google search, I found this article which tries to solve the knapsack problem with AI written in SQL.
By considering your item price as a weight, having the number of items and the desired average, you could compute the maximum value that can be entered in the 'knapsack' by multiplying desired_cost with number_of_items
I'm not entirely sure from your question, but I think this is a solution to your problem:
SELECT * FROM items
WHERE (some "conditions", e.g. issue_date > 2011 and listed=1)
AND state_id = 5
AND id IN (SELECT id
FROM items
GROUP BY id
HAVING AVG(price) <= 500)
ORDER BY price DESC
LIMIT 25
note: This is off the top of my head and I haven't done complex SQL in a while, so it might be wrong. I think this or something like it should work, though.

How would I do this in MySQL?

Lets say I have a database of widgets. I am showing a list of the top ten groupings of each widget, separated by category.
So lets say I want to show a list of all widgets in category A, but I want to sort them based on the total number of widgets in that category and only show the top 10 groupings.
So, my list might look something like this.
Top groupings in Category A
100 Widgets made by company 1 in 1990.
90 Widgets made by company 1 in 1993.
70 Widgets made by company 3 in 1993.
etc...(for 10 groupings)
This part is easy, but now lets say I want a certain grouping to ALWAYS show up in the listings even if it doesnt actually make the top ten.
Lets say I ALWAYS want to show the number of Widgets made by company 1 in 2009, but I want this grouping to be shown somewhere in my list randomly (not first or last)
So the end list should look something like
Top groupings in Category A
100 Widgets made by company 1 in 1990.
90 Widgets made by company 1 in 1993.
30 Widgets made by company 1 in 2009.
70 Widgets made by company 3 in 1993.
How would i accomplish this in MySQL?
thanks
Edit:
Currently, my query looks like this
SELECT
year,
manufacturer,
MAX(price) AS price,
image_url,
COUNT(id) AS total
FROM
widgets
WHERE
category_id = A
AND
year <> ''
AND
manufacturer <> ''
GROUP BY
category_id,
manufacturer,
year
ORDER BY
total DESC,
price ASC
LIMIT
10
);
Thats without the mandatory grouping in there.
The placement doesnt necessarily have to be random, just shouldnt be on any extreme end. And the list should be 10 groupings including the mandatory listing. So 9 + 1
I would use an UNION query: your current query union the query for 2009, then handle the sorting in the presentation layer.
You can write 2 separate query (one for all companies and another just for company 1) and then use UNION to join them together. Finally, add ORDER BY RAND().
It will look like
SELECT * FROM
(
SELECT company_id, company_name, year, count(*) as num_widgets
....
LIMIT 10
UNION DISTINCT
SELECT company_id, company_name, year, count(*) as num_widgets
...
WHERE company_id =1
...
LIMIT 10
)x
ORDER BY RAND();
You could add a field that you make true for company 1 in 2009 and include it in the where clause. Something like
select * from companies where group = 'some group' or included = true order by included, widgets_made limit 10
For the random part you would have that as subquery then include a column that has a random number from 1 to 10 if the field that you made is true, and rownum otherwise, then sort by that column