I'm doing some statistics based on a database of states. I would like to output the rank of a state and it's percentage as compared to the other states (i.e. state X's value is higher then 55% of the other states' value).
I'm trying something like this:
SELECT
count(*) AS TotalStates,
(SELECT COUNT(*) FROM states) AS NumberStates,
(TotalStates/NumStates) AS percentage
FROM states
WHERE CRITERIA > 7.5
I'm getting an SQL error, TotalStates (my derived value) is not found. How can I get all three of these values returned with one query?
You can put the main calculations in a subselect, then reference the aliased columns in the outer query, both to pull the already calculated values and to obtain another one from them:
SELECT
TotalStates,
NumberStates,
TotalStates / NumberStates AS percentage
FROM (
SELECT
COUNT(*) AS TotalStates,
(SELECT COUNT(*) FROM states) AS NumberStates
FROM states
WHERE CRITERIA > 7.5
) s
The error you are getting comes from the fact that you are trying to use the derived value in the same select clause that you are creating it in. You will need to maybe do something along these lines:
SELECT count(*) as TotalStates,
(SELECT count(*) from states) as NumberStates,
(count(*)/(SELECT count(*) from states)) as percentage
FROM states
WHERE criteria = x
However, this is not very efficient or desirable for readability or maintainability. Is there a design reason that you cannot perform this in two queries, or better yet, get the two data items in separate queries and calculate the percentage in the consuming code?
Related
Sales :
Q1) Return the name of the agent who had the highest increase in sales compared to the previous year
A) Initially I wrote the following query
Select name, (sales_2018-sales_2017) as increase
from sales
where increase= (select max(sales_2018-sales_2017)
from sales)
I got an error saying I cannot use increase with the keyword where because "increase" is not a column but an alias
So I changed the query to the following :
Select name, (sales_2018-sales_2017) as increase
from sales
where (sales_2018-sales_2017)= (select max(sales_2018-sales_2017)
from sales)
This query did work, but I feel there should be a better to write this queryi.e instead of writing where (sales_2018-sales_2017)= (select max(sales_2018-sales_2017) from sales). So I was wondering if there is a work around to using alias with where.
Q2) suppose the table is as following, and we are asked to return the EmpId, name who got rating A for consecutive 3 years :
I wrote the following query its working :
select id,name
from ratings
where rating_2017='A' and rating_2018='A' and rating_2019='A'
Chaining 3 columns (ratings_2017,rating_2018,rating_2019) with AND is easy, I want know if there is a better way to chain columns with AND when say we want to find a employee who has rating 'A' fro 10 consective years.
Q3) Last but not the least, I'm really interested in learning to write intermediate-complex SQL queries and take my sql skills to next level. Is there a website out there that can help me in this regard ?
1) You are referencing an expression with a table column value, and therefore you would need to define the expression first(either using an inline view/cte for increase). After that you can refer it in the query
Eg:
select *
from ( select name, (sales_2018-sales_2017) as increase
from sales
)x
where x.increase= (select max(sales_2018-sales_2017)
from sales)
Another option would be to use analytical functions for getting your desired results, if you are in mysql 8.0
select *
from ( select name
,(sales_2018-sales_2017) as increase
,max(sales_2018-sales_2017) over(partition by (select null)) as max_increase
from sales
)x
where x.increase=x.max_increase
Q2) There are alternative ways to write this. But the basic issue is with the table design where you are storing each rating year as a new column. Had it been a row it would have been more easy.
Here is another way
select id,name
from ratings
where length(concat(rating_2017,rating_2018,rating_2019))-
length(replace(concat(rating_2017,rating_2018,rating_2019)),'A','')=3
Q3) Check out some example of problems from hackerrank or https://msbiskills.com/tsql-puzzles-asked-in-interview-over-the-years/. You can also search for the questions and answers from stackoverflow to get solutions to tough problems people faced
Q1 : you can simply order and limit the query results (hence no subquery is necessary) ; also, column aliases are allowed in the ORDER BY clause
SELECT
name,
sales_2018-sales_2017 as increase
FROM sales
ORDER BY increase DESC
LIMIT 1
Q2 : your query is fine ; other options exists, but they will not make it faster or easier to maintain.
Finally, please note that your best option overall would be to modify your database layout : you want to have yearly data in rows, not in columns ; there should be only one column to store the year instead of several. That would make your queries simpler to write and to maintain (and you wouldn’t need to create a new column every new year...)
I've been trying to learn MySQL, and I'm having some trouble creating a join query to not select duplicates.
Basically, here's where I'm at :
SELECT atable.phonenumber, btable.date
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
However, in my database, there is the possibility of having duplicate rows in column atable.phonenumber.
For example (added asterisks for clarity)
phonenumber | date
-------------|-----------
*555-681-2105 | 2015-08-12
555-425-5161 | 2015-08-15
331-484-7784 | 2015-08-17
*555-681-2105 | 2015-08-25
.. and so on.
I tried using SELECT DISTINCT but that doesn't work. I also was looking through other solutions which recommended GROUP BY, but that threw an error, most likely because of my WHERE clause and condition. Not really sure how I can easily accomplish this.
DISTINCT applies to the whole row being returned, essentially saying "I want only unique rows" - any row value may participate in making the row unique
You are getting phone numbers duplicated because you're only looking at the column in isolation. The database is looking at phone number and also date. The rows you posted have different dates, and these hence cause the rows to be different
I suggest you do as the commenter recommended and decide what you want to do with the dates. If you want the latest date for a phone number, do this:
SELECT atable.phonenumber, max(btable.date)
FROM battle
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
When you write a query that uses grouping, you will get a set of rows where there is only one set of value combinations for anything that is in the group by list. In this case, only unique phone numbers. But, because you want other values as well (I.e. Date) you MUST use what's called an aggregate function, to specify what you want to do with all the various values that aren't part of the unique set. Sometimes it will be MAX or MIN, sometimes it will be SUM, COUNT, AVG and so on.
if you're familiar with hash tables or dictionaries from elsewhere in programming, this is what a group by is: it maps a set of values (a key) to a list of rows that have those key values, and then the aggregating function is applied to any of the values in the list associated with the key
The simple rule when using group by (and one that MySQL will do implicitly for you) is to write queries thus:
SELECT
List,
of,
columns,
you,
want,
in,
unique,
combination,
FN(List),
FN(of),
FN(columns),
FN(you),
FN(want),
FN(aggregating)
FROM table
GROUP BY
List,
of,
columns,
you,
want,
in,
unique,
combination
i.e. You can copy paste from your select list to your group list. MySQL does this implicitly for you if you don't do it (i.e. If you use one or more aggregate functions like max in your select list, but forget or omit the group by clause- it will take everything that isn't in an agggregate function and run the grouping as if you'd written it). Whether group by is hence largely redundant is often debated, but there do exist other things you can do with a group by, such as rollup, cube and grouping sets. Also you can group on a column, if that column is used in a deterministic function, without having to group on the result of he deterministic function. Whether there is any point to doing so is a debate for another time :)
You should add GROUP BY, and an aggregate to the date field, something like this:
SELECT atable.phonenumber, MAX(btable.date)
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
This will return the maximum date, hat is the latest date...
My question is: Why do the following two SQL statements produce different results (I am explaining both afterwards. Tested with MariaDB bundeled with XAMPP 7.0.8)
(1)
SELECT
stock_exchange_code,
summed
FROM (
SELECT
stock_exchange_code,
summed
FROM (
SELECT
STOCK_EXCHANGE_CODE,
sum(SHARE_PRICE * SHARE_CNT) AS summed
FROM LISTED_AT
WHERE DATE_VALID = STR_TO_DATE('04-12-2015', '%d-%m-%Y')
GROUP BY STOCK_EXCHANGE_CODE
) a) b
HAVING summed > avg(summed)
(2)
SELECT
stock_exchange_code,
summed
FROM (
SELECT
STOCK_EXCHANGE_CODE,
sum(SHARE_PRICE * SHARE_CNT) AS summed
FROM LISTED_AT
WHERE DATE_VALID = STR_TO_DATE('04-12-2015', '%d-%m-%Y')
GROUP BY STOCK_EXCHANGE_CODE
) a
WHERE summed > (SELECT avg(a.summed)
FROM (SELECT
sum(SHARE_PRICE * SHARE_CNT) AS summed
FROM LISTED_AT
WHERE DATE_VALID = STR_TO_DATE('04-12-2015', '%d-%m-%Y')
GROUP BY STOCK_EXCHANGE_CODE) a)
Result of those queries:
(1) will give you an empty set (I do not understand why)
(2) will give you 2 rows, which is the correct answer
Explanation of the 2 Select statements:
SELECT
STOCK_EXCHANGE_CODE,
sum(SHARE_PRICE * SHARE_CNT) AS summed
FROM LISTED_AT
WHERE DATE_VALID = STR_TO_DATE('04-12-2015', '%d-%m-%Y')
GROUP BY STOCK_EXCHANGE_CODE
This is the part of the Select statement, which sums up all share values at a specific stock exchange.
The output is:
BRX 122653.50
L&S 275000.00
MXK 500000.00
STU 140415.00
XETRA 254610.00
And AVG(summed) = 258535.6
With statement (1) [which I tried first) I use an select around to be certain that the group by is global. Looking at it now, there is one unnecessary "serlect all columns by name", bur this should not matter here. With the outer select I try to apply the "HAVING" clause.
I do want all stock exchanges which summed value ( => "summed") on a specific day is above average. As far as I understand HAVING, it should calculate the global average ( => of the 5 stock exchanges above) and check against that.
I do not know, why this is not working. Changing the summed > avg(summed) to summed <> avg(summed) results in one row ( BRX 122653.50).
summed > 0 results in all 5 rows returned.
This is the reason why I think the average does not work with the having and not the other way round.
(2)
This is quite the same as the first, replacing the HAVING clause with an more explecit average calculation. As you can see there are 2 subqueries with the name "a" and both are the same (the second one lacks the stock_exchange_code field. Practically this query is ident with the first one, with worse code quality than the first one (duplication).
My question is: For me the 2 queries should have an identical result. Why do they have a different result?
tl;dr
Average or having clause does not seem to work in MySQL (MariaDB). Why do the 2 SQL statements from the beginning not return the same?
The use of an aggregate triggers grouping the entire table into one row. That is, HAVING summed > avg(summed) causes it to be one row, not some subset of the collection of rows. Hence, #1 is probably not useful.
In the second query, spelling out the avg(summed) as SELECT ... is generating one value that is then used for each row.
It seems that you have an extra level of SELECTs in both queries.
You can use EXPLAIN SELECT ... to get more clues on what is going on.
I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?
Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.
I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.
All - I'm trying to do a basic (in theory) On-Time completion report. I'd like to to list
assigned_to_id | Percent on-time (as a percent - but this is not important now)
I figure I tell MySQL get count of all tasks and a list of all tasks marked close on a date prior to the due date and give me that number... Seems simple?
I'm a sysadmin - not a SQL Developer so excuse the grossness to follow!
I've got
select issues.assigned_to_id, tb1.percent from (
Select
(select count(*) from issues where issues.due_date >= date(issues.closed_on) group by issues.assigned_to_id)/
(select count(*) from issues group by issues.assigned_to_id) as percent
from issues)
as tb1
group by tb1.percent;
It's been mixed up a bit with me trying to solve the multple rows issues so it may be even worse off when I started - but if I could get a list of users with their percentage that would be great!
I'd love to have use something like a "for each" but i know that doesn't exist.
Thanks!
You've got a division operation, e.g (foo) / (bar) and both the numerator and denominator are subqueries, Since you're expecting to take those subqueries and divide their answers, they MUST return a SINGLE value each, e.g. 1 / 2.
The error message indicates that one (or probably both) is returning a multi-value query result, so in effect you're trying to do something like 1,2,3 / 4,5,6, which is not a valid math operation, and you end up with your error message.
Fix the subqueries so they return only a SINGLE value each.
MySQL has equivalent of cross/outer apply which match for this case
SELECT T.*,Data.Value FROM [Table] T OUTER APPLY
you can try to use that.
What you should probably be doing is using an IF statement:
SELECT
assigned_to_id,
SUM(IF(due_date >= date(closed_on), 1, 0))/SUM(1) AS percent
FROM issues
GROUP BY assigned_to_id
ORDER BY percent DESC
Note here I am grouping by assigned_to_id and ordering by percent. This allows you to calculate the percentage for each assigned_to_id group and order those groups by percent.
if we have to exactly rewrite your query, try this.
select issues.assigned_to_id, ((select count(*) from issues where issues.due_date >= date(issues.closed_on) )/
(select count(*) from issues)) as perc from issues group by issues.assigned_to_id;
I think you want something like this, a list of issue ids and percentage done on time.
select distinct issues.assigned_to_id, done_on_time.c / issue_count.c as percent
from issues
join
(select issues.assigned_to_id, count(*) as c
from issues
where issues.due_date >= date(issues.closed_on)
group by issues.assigned_to_id) as done_on_time
on issues.assigned_to_id = done_on_time.assigned_to_id
(select issues.assigned_to_idm, count(*) as c
from issues
group by issues.assigned_to_id) as issue_count
on issues.assigned_to_id = issue_count.assigned_to_id