A sample record:
Row(user_id='KxGeqg5ccByhaZfQRI4Nnw', gender='male', year='2015', month='September', day='20',
hour='16', weekday='Sunday', reviewClass='place love back', business_id='S75Lf-Q3bCCckQ3w7mSN2g',
business_name='Notorious Burgers', city='Scottsdale', categories='Nightlife, American (New), Burgers,
Comfort Food, Cocktail Bars, Restaurants, Food, Bars, American (Traditional)', user_funny='1',
review_sentiment='Positive', friend_id='my4q3Sy6Ei45V58N2l8VGw')
This table has more than a 100 million records. My SQL query is doing the following:
Select the most occurring review_sentiment among the friends (friend_id) and the most occurring gender among friends of a particular user visiting a specific business
friend_id is eventually a user_id
Example Scenario:
One user
Has Visited 4 Businesses
Has 10 friends
5 of these friends have visited Business 1 & 2 while other 5 have
visited 3rd business only and none have visited the fourth
Now, for Business 1 and 2, the 5 friends have more positive than
negative sentiments for B1 and have more -ve than +ve sentiment for
B2 and all -ve for B3
I want the following output for this:
**user_id | business_id | friend_common_sentiment | mostCommonGender | .... otherCols**
user_id_1 | business_id_1 | positive | male | .... otherCols
user_id_1 | business_id_2 | negative | female | .... otherCols
user_id_1 | business_id_3 | negative | female | .... otherCols
Here's a simple query I wrote for this in pyspark:
SELECT user_id, gender, year, month, day, hour, weekday, reviewClass, business_id, business_name, city,
categories, user_funny, review_sentiment FROM events1 GROUP BY user_id, friend_id, business_id ORDER BY
COUNT(review_sentiment DESC LIMIT 1
This query will not give what is expected but I'm not sure how exactly to fit in a INNER-JOIN into this?
Man does that data structure make things hard. But lets break it down into steps,
You need to self join to get the data for friends
Once you have the data for friends, perform aggregate functions to get counts of each possible value, grouping by the user and the business
sub query the above in order to make decisions between the values based on counts.
I'm just going to call your table "tags", so the join would be as follows, sadly just like in real life we can't assume everyone has friends, and since you didn't specify to exclude the forever alone crowd, we need to use a left join to keep users without friends.
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
and friends.business_id = user.business_id
Next you have to figure out what the most common gender/review is for a given user and business combination. This is where the data structure really kicks us in the butt, we could do this in one step with some clever window functions, but I want this answer to be easily understood, so I'm going to use a sub-query and a case statements. For the sake of simplicity I'm assuming binary genders, but depending on the woke level of your app, you can follow the same patterns for additional genders.
select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id
Now we just have to grab data from the sub-query and make some decisions, you may want to add some additional options, for instance you may want to add options in case there are no friends, or friends are evenly split between gender/sentiment. same pattern as below though with extra values to choose from.
select user_id
, business_id
, case when MaleFriends > than FemaleFriends then 'Male' else 'Female' as MostCommonGender
, case when FriendsPositive > FriendsNegative then 'Positive' else 'Negative' as MostCommonSentiment
from ( select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id) as a
This gives you the steps to follow, and hopefully a clear explanation on how they work. Good luck!
Related
Here is my SQL query:
SELECT
COUNT(CASE WHEN `urgency`='1' THEN 1 END) AS verylow,
COUNT(CASE WHEN `urgency`='2' THEN 1 END) AS low,
COUNT(CASE WHEN `urgency`='3' THEN 1 END) AS standard,
COUNT(CASE WHEN `urgency`='4' THEN 1 END) AS high,
COUNT(CASE WHEN `urgency`='5' THEN 1 END) AS critical,
tbl_users.userName
FROM
notes, tbl_users
WHERE
notes.responsible = tbl_users.userID
AND project_id = '4413'
AND (notes.status = 'Ongoing' OR notes.status = 'Not started')
and the output is:
verylow low standard high critical userName
5 1 2 1 1 Nick
However this is wrong because i have multiple users in the database who have assigned tasks. and it looks like this in my database:
urgency userName
3 Nick
5 Nick
4 Nick
3 James
1 James
1 Nick
2 Nick
1 James
1 Nick
1 Nick
Any idea why it doesn't count the urgency for the other user and how many different urgencies he has?
What you are doing is not entirely correct. If you would turn on MySQL mode ONLY_FULL_GROUP_BY, you'd get a warning, because you are selecting a column that is not in the GROUP BY clause without applying an aggregation function. So you need a GROUP BY clause.
The entire query should look like so:
SELECT
COUNT(CASE WHEN `urgency`='1' THEN 1 END) AS verylow,
COUNT(CASE WHEN `urgency`='2' THEN 1 END) AS low,
COUNT(CASE WHEN `urgency`='3' THEN 1 END) AS standard,
COUNT(CASE WHEN `urgency`='4' THEN 1 END) AS high,
COUNT(CASE WHEN `urgency`='5' THEN 1 END) AS critical,
tbl_users.userName
FROM
notes, tbl_users
WHERE
notes.responsible = tbl_users.userID
AND project_id = '4413'
AND (notes.status = 'Ongoing' OR notes.status = 'Not started')
GROUP BY tbl_users.userName;
With COUNT you aggregate your rows. As there is no GROUP BY clause, you aggregate them to one row (rather than, say a row per user).
You are selecting userName. Which? As you select only one row, the DBMS picks one arbitrarily.
I am not entirely sure even how to name this post, because I do not know exactly how to ask it.
I have three tables. One with users, one with foods and one with the users rating of the foods, like such (simplified example):
Foods
id name type
---------------------
1 Apple fruit
2 Banana fruit
3 Steak meat
Users
id username
-----------------
1 Mark
2 Harrison
3 Carrie
Scores (fid = food id, uid = user id)
fid uid score
---------------------
1 1 3
1 2 5
2 1 2
3 2 3
Now, I have this query, which works perfectly:
SELECT fn.name as Food, ROUND(AVG(s.score),1) AS AvgScore FROM Foods fn LEFT JOIN Scores s ON fn.id = s.fid GROUP BY fn.id ORDER BY fn.name ASC
As you can tell, it lists all names from Foods including an average from all users ratings of the food.
I also want to add the unique users own score. (Assume that when Mark is logged in, his uid is set in a session variable or whatever)
I need the following output, if you are logged in as Mark:
Food AvgScore Your Score
Apple 4 3
I have made several attempts to make this happen, but I cannot find the solution. I have learned that if you have a question, it is very likely that someone else has asked it before you do, but I do not quite know how to phrase the question, so I get no answers when googling. A pointer in the right direction would be much appreciated.
You can try with case:
SELECT fn.name as Food,
ROUND(AVG(s.score),1) AS AvgScore,
sum(case s.uid = $uid when s.score else 0 end) as YourScore
FROM Foods fn
LEFT JOIN Scores s ON fn.id = s.fid
GROUP BY fn.id
ORDER BY fn.name ASC
$uid is variable off course.
Not written MySQL for a very long time and I can't get my head around why this is not working! I have written the following to hopefully allow me to see crop yield per year.
I have two tables, one states how many plants of said variety with the following fields this is called "growseason":
id
username
variety
datestamp
plants
my other table has entries when a user adds a harvest to the database, this is called "harvest" with the following fields:
id
datestamp
username
variety
picked
weight
I am trying to create a table that shows year on year crop per plant, this will give me an indication if the crop is better or worse than the previous year.
SELECT g.Variety,
ROUND(SUM(IF(YEAR(h.datestamp)=YEAR(CURRENT_DATE),h.picked,0)) /
IF(YEAR(g.datestamp)=YEAR(CURRENT_DATE),g.plants,0),0) As FruitPerPlantThisYear,
ROUND(SUM(IF(YEAR(h.datestamp)=YEAR(CURRENT_DATE)-1,h.picked,0)) /
IF(YEAR(g.datestamp)=YEAR(CURRENT_DATE)-1,g.plants,0),0) As FruitPerPlantLastYear
FROM harvest h
LEFT JOIN growseason g ON h.variety = g.variety AND YEAR(h.datestamp) = YEAR(g.datestamp) AND h.username = g.username
WHERE g.username = 'Palendrone' AND picked <> '0'
GROUP BY variety, g.datestamp
Expected output:
Variety | FruitPerPlantThisYear | FruitPerPlantLastYear
-------------------------------------------------------
Var1 | 34 | 31
Var2 | 112 | 123
Var3 | 67 | 41
Actual output:
Variety | FruitPerPlantThisYear | FruitPerPlantLastYear
-------------------------------------------------------
Var1 | 34 |
Var2 | | 123
Var3 | | 41
I understand the g.datestamp in my groupby duplicates the variety names but if I don't add that I am only getting a single instance this year or last year). Having spent hours trying to solve this I am now all out of ideas.
I give in and accept help please! Also not sure how I can structure this any better...
I think you are looking for conditional aggregation and I don't know where you get g.datestamp from or g.plants since they ain't in your table definition.
SELECT g.Variety,
sum(case when YEAR(h.datestamp)=YEAR(CURRENT_DATE) then h.picked else 0 end) /
sum(case when YEAR(g.datestamp)=YEAR(CURRENT_DATE) then g.plants else 0 end) as fruitPerPlantThisYear,
sum(case when YEAR(h.datestamp)=YEAR(CURRENT_DATE) -1 then h.picked else 0 end) /
sum(case when YEAR(g.datestamp)=YEAR(CURRENT_DATE) -1 then g.plants else 0 end) as fruitPerPlantThislastYear
FROM harvest h
LEFT JOIN growseason g ON h.variety = g.variety AND h.username = g.username
WHERE g.username = 'Palendrone' AND picked <> '0'
GROUP BY g.variety
I am mostly new to SQL and ran across a situation that I can't figure out. Say that I have 2 tables: P and A.
Person id Live Income
--------- -- ---- ------
Tom 1 House 10
Sarah 2 Apt 7
Sterling 3 Playpen 0
Chris 4 House 6
Juanita 5 Apt 12
...
Live2 id Attribute
--------- -- -----
House 1 Job
House 2 Car
House 3 Kids
Apt 4 Job
Apt 5 Car
Playpen 6 Diapers
So if you have a 'House' then you always also have a Job, Car, and Kids. If you have a Playpen then you only have Diapers (and never a Job).
What I am trying to do (without double-counting people) is find the total income for 'House' people (Live='House', the 1st category), then 'Job' people (Attribute='Job', 2nd category), then 'Diaper' people, etc. So Tom is counted as a 'House' person but not a 'Job' person (because he has been previously classified and I don't want to double-count income).
Logically I can think of several ways to approach and based on my research this seems to be a perfect place to use the long form of CASE because I can specify conditions from different columns. BUT, I can't seem to join the tables in a way that I don't end up double counting income by creating too many rows. For example, I'll JOIN them and it will create 3 Tom entries (one each for Job, Car, Kids).
IMO either we need multiple 'Attribute' columns (one each for Job, Car, Kids, Diapers) so 'Tom' is still fully described on one row or some way to ignore all the other 'Tom' rows once he has been counted in a classification.
without knowing some additional details im guessing this is what you want... table a is the one with the person in it table b has live2 in it
SELECT
SUM(CASE WHEN live = 'House' THEN income ELSE 0 END) as house,
SUM(CASE WHEN live = 'Apt' THEN income ELSE 0 END) as apt,
SUM(CASE WHEN live = 'Playpen' THEN income ELSE 0 END) as playpen
FROM
( SELECT a.*
FROM a
JOIN b ON b.live2 = a.live
GROUP BY a.id
)t
DEMO
what this is assuming is that besides house the only other live is apt that can have a job. if thats the case then this query will do what you want.
If you want to actually specify 'Job' in the query then you can do it like this.
SELECT
SUM(CASE WHEN live = 'House' THEN income ELSE 0 END) as house,
SUM(CASE WHEN attribute = 'Job' AND live2 <> 'House' THEN income ELSE 0 END) as apt,
SUM(CASE WHEN live = 'Playpen' THEN income ELSE 0 END) as playpen
FROM
( SELECT a.*, b.live2, b.attribute
FROM a
JOIN b ON b.live2 = a.live
GROUP BY a.id
)t
you could also join with each field specified.. if you want an example I can show you
You're question is a bit strange, I'll agree. There isn't any complication since none of your persons live in a house and an apartment...
select live,
sum(income) income,
count(*) people
from p
left join
a
on p.live = a.live2
and a.attribute = 'job'
group by live
I have a three table setup : kids, toys and games, each with unique primary keys : id_kid, id_toy and id_game. Each kid can have multiple toys and games, but each toy or game is owned by only one kid.
The toys and games have a bought column with 3 states : -1,0,1
The table structure is something like this :
kids
id_kid
kid_name
etc
games
id_game
id_kid_games --> links with id_kid in kids_table (maybe not the best name, I know)
game_name
bought --> can be -1,0,1
toys
id_toy
id_kid_toys --> links with id_kid in kids_table
toy_name
bought --> can be -1,0,1
For each kid i'm trying to get a total of toys and games, bought and not bought, using the query below, however the results are doubled :
SELECT kids.*,
COUNT(DISTINCT toys.id_toy) AS total_toys,
SUM(CASE toys.bought WHEN 1 THEN 1 ELSE 0 END) AS toys_bought,
SUM(CASE toys.bought WHEN -1 THEN 1 ELSE 0 END) AS toys_not_bought,
COUNT(DISTINCT games.id_game) AS total_games,
SUM(CASE games.bought WHEN 1 THEN 1 ELSE 0 END) AS games_bought,
SUM(CASE games.bought WHEN -1 THEN 1 ELSE 0 END) AS games_not_bought
FROM kids as k
LEFT JOIN toys t ON k.id_kid = t.id_kid_toys
LEFT JOIN games g ON k.id_kid = g.id_kid_games
GROUP BY k.id_kid
ORDER BY k.name ASC
One kid has 2 toys and 4 games, all bought, and the results are 2 total toys (correct), 4 total games (correct), 8 toys bought, 8 games bought. (both wrong)
Please help with an answer - if possible - without using subselects.
Thank you.
As you are selecting data from two unrelated relations (kids joined to toys, and kids joined to games), subqueries are the natural way of doing it. As uncorrelated subqueries may be used, this should not be particularly slow.
Try if this query is sufficiently efficient:
Compared to your original query, it basically just reverses the order of joinining and grouping.
SELECT kids.*, t.total_toys, t.toys_bought, t.toys_not_bought,
g.total_games, g.games_bought, g.games_not_bought
FROM kids
LEFT JOIN (SELECT id_kids_toys,
COUNT(*) AS total_toys,
SUM(CASE bought WHEN 1 THEN 1 ELSE 0 END) as toys_bought,
SUM(CASE bought WHEN -1 THEN 1 ELSE 0 END) as toys_not_bought
FROM toys
GROUP BY id_kids_toys) AS t
ON t.id_kids_toys = kids.id_kid
LEFT JOIN (SELECT id_kids_games,
COUNT(*) AS total_games,
SUM(CASE bought WHEN 1 THEN 1 ELSE 0 END) as games_bought,
SUM(CASE bought WHEN -1 THEN 1 ELSE 0 END) as games_not_bought
FROM games
GROUP BY id_kids_games) AS g
ON g.id_kids_games = kids.id_kid
ORDER by kids.name;
If you insist on avoiding subqueries, this, probably far less efficient, query might do:
SELECT kids.*,
COUNT(DISTINCT toys.id_toy) AS total_toys,
-- sum only toys joined to first game
SUM(IF(g2.id_game IS NULL AND bought = 1, 1, 0)) AS toys_bought,
SUM(IF(g2.id_game IS NULL AND bought = -1, 1, 0)) AS toys_not_bought,
-- sum only games joined to first toy
COUNT(DISTINCT games.id_game) AS total_games,
SUM(IF(t2.id_toy IS NULL AND bought = 1, 1, 0)) AS games_bought,
SUM(IF(t2.id_toy IS NULL AND bought = -1, 1, 0)) AS games_not_bought
FROM kids as k
LEFT JOIN toys t ON k.id_kid = t.id_kid_toys
LEFT JOIN games g ON k.id_kid = g.id_kid_games
-- select only rows where either game or toy is the first one for this kid
LEFT JOIN toys t2 on k.id_kid = t.id_kid_toys AND t2.id_toy < t.id_toy
LEFT JOIN games g2 ON k.id_kid = g.id_kid_games AND g2.id_game < g.id_game
WHERE t2.id_toy IS NULL OR g2.id_game IS NULL
GROUP BY k.id_kid
ORDER BY k.name ASC
It works by ensuring that for each kid, only the games joined to the first toy is counted, and only the toys joined to the first game is counted.