I was doing an assignment where I had to convert SQL queries into Relational Algebra queries. I got stuck in converting the group by clause.
Could anyone please tell me how the group by clause can be written in relational algebra?
e.g.:
SELECT job, sal
FROM emp
GROUP BY job
;
Thanks!
Noting you want to get the sum of salary, in Tutorial D:
SUMMARIZE emp BY { job } ADD ( SUM ( sal ) AS total_sal )
Note aggregation is not a relational operator, hence will not form part of a relational algebra.
As for HAVING, is it a historical anomaly. Before the SQL-92 Standard, it was not possible to write SELECT expressions in the FROM clause (a.k.a derived tables) i.e. you had to do all work in one SELECT expression. Because of SQL's rigid evaluation order, the aggregate value doesn't come into existence after the WHERE clause has been evaluated i.e. it was impossible apply restriction based on aggregated values. HAVING was introduced to address this problem.
But even with HAVING, SQL remained relationally incomplete as regards Codd's until derived tables had been introduced. Derived tables rendered HAVINGredundant but using HAVING is still popular (if Stackoverflow is anything to go by): folk still seem to like to use a single SELECT where possible and SQL's aforementioned rigidity as regards evaluations order (projection is performed last in a SELECT expression) makes derived table usage quite verbose when compared to HAVING.
First of all your query is wrong you cannot select something that you did not group unless you use aggregation. I assume you want to get sum of the sal.
job F sum(sal), job(emp).
Related
I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.
My requirements are: I now have a table, I need to group according to one of the fields, and get the latest record in the group, and then I search the scheme on the Internet,
SELECT
* FROM(
SELECT
*
FROM
record r
WHERE
r.id in (xx,xx,xx) HAVING 1
ORDER BY
r.time DESC
) a
GROUP BY
a.id
, the result is correct, but I can't understand the meaning of "having 1" after the where statement. I hope a friend can give me an answer. Thank you very much.
It does nothing, just like having true would. Presumably it is a placeholder where sometimes additional conditions are applied? But since there is no group by or use of aggregate functions in the subquery, any having conditions are going to be treated no differently than where conditions.
Normally you select rows and apply where conditions, then any grouping (explicit, or implicit as in select count(*)) occurs, and the having clause can specify further constraints after the grouping.
Note that your query is not guaranteed to give the results you want; the order by in the subquery in theory has no effect on the outer query and the optimizer may skip it. It is possible the presence of having makes a difference to the optimizer, but that is not something you should rely on, certainly from one version of mysql to another.
SELECT A.country
,A.market
,SUM(COALESCE(A.sales_value,0)) AS local_currency_sales
FROM
(
SELECT country
,market
,local_currency_sales
FROM TABLE
) A
GROUP BY A.country
,A.market
The above is the pseudo code I am referring to. I am new to SQL and would like to ask the reason that if there is a need to have a nested Select like this above? I tried to remove the nested Selected and it throws an error: country must appear in the Group By Clause. What I would like to know is that intuitively, the below should work even with Group by
SELECT A.country
,A.market
,SUM(COALESCE(A.sales_value,0)) AS local_currency_sales
FROM TABLE A
GROUP BY A.country
,A.market
Your original post does not need a sub query at all, in fact the first query will not execute because it is trying to perform a SUM over a column that is not defined. The following adaptation might help explain how this works:
SELECT A.CName
,A.M
,SUM(A.Sales) AS local_currency_sales
FROM
(
SELECT country as CName
,market as M
,COALESCE(A.sales_value,0) as Sales
FROM TABLE
) A
GROUP BY A.CName
,A.M
The use of Sub-Queries is usually to make the query easier to read or maintain. In this example you can see that the nested query evaluates the COALESCE function and has aliased the column names.
It is important to note that the outer query can only access the columns that are returned from the inner query AND that they can only be accessed by aliases that have been assigned.
In advanced scenarios you might use nested queries to manually force query optimisations, however most database engines will have very good query optimisations by default, so it is important to recognise that nested queries can easily get in the way of standard optimisations and if done poorly may negatively affect performance.
There are of course some types of expressions that cannot be used in Order By, and Group By clauses. When you come across these conditions it might be neccessary to nest the query so that you can group or sort by the results of those evaluations.
Window functions is a prime example of this requirement.
The specific queries where this becomes important are different for each RDBMS, and they each have their own workarounds or alternate implementations that may be more efficient than using a sub-query at all. The specifics are outside the scope of this post.
The simple form of your query that you have posted is all you need in this case:
SELECT A.country
,A.market
,SUM(COALESCE(A.sales_value,0)) AS local_currency_sales
FROM TABLE A
GROUP BY A.country
,A.market
I came up with this solution in my class by piecing together Internet knowledge. Please break this down for me I would love to know how I made it work. Specifically the t.s and the closing t.
SELECT
CourseType,
GPA,
NumberOfStudents * 100 / t.s AS `Percentage of Students`
FROM View1
CROSS JOIN
(
SELECT
SUM(NumberOfStudents) AS s
FROM View1) t;
Your query uses a subquery. A sub-query is a query that is done within another query. In your case, your subquery is:
(
SELECT
SUM(NumberOfStudents) AS s
FROM View1)
When you create subqueries, you need to give them an alias. An alias is just a name you give a subquery, so you can use it in the main query.
In your example, you named your subquery "t".
Fields can also have aliases. in your subquery, you created a field SUM(NumberOfStudents), and you named it s.
Going back to your question, you use the aliases to address fields inside the subquery. in your case, when you do 100 / t.s you are basically saying:
"I want to divide 100 by the field s from my subquery t".
The other concept that is important in your query is the Cross join. A cross join is the Cartesian product of two tables.
You can find a great and intuitive explanation of how a cross join works in the following link:
https://www.sqlshack.com/sql-cross-join-with-examples/#:~:text=The%20CROSS%20JOIN%20is%20used,also%20known%20as%20cartesian%20join.&text=The%20main%20idea%20of%20the,product%20of%20the%20joined%20tables.
I this case, the use is simpler than that. your subquery should return only one value, which is the sum of all students. And since a cross join basically pairs every row of one table with every row from the other, your cross join just provides a way to use the number of students as a constant value for the calculation of the percentage of students in the main query.
A better way to do this uses window functions:
SELECT v.CourseType, v.GPA,
v.NumberOfStudents * 100 / SUM(v.NumberOfStudents) OVER () AS Percentage_of_Students
FROM View1 v;
If you are learning SQL, you might as well learn the correct way to express logic.
Notes:
Use meaningful table aliases (abbreviations for the table/view names).
Qualify column references. This is less important in a query with only one table reference, but it is a good habit.
Window functions allow you to summarize data across multiple rows, without using an explicit JOIN.
Why would someone use a group by versus distinct when there are no aggregations done in the query?
Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.
I'm interested in dba answers.
EDIT:
Bill's post is interesting, but not applicable. Let me be more specific...
select a, b, c
from table x
group by a, b,c
versus
select distinct a,b,c
from table x
GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.
SELECT b, c, d FROM table1 GROUP BY a;
This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.
However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.
Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:
SELECT DISTINCT(a), b, c FROM table1;
Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.
A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.
For the pattern:
SELECT col1, col2 FROM table GROUP BY col1, col2
and
SELECT DISTINCT col1, col2 FROM table
When there's no covering index for the query, both ways produced the following query plan:
|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
|--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))
and when there was a covering index, both produced:
|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
|--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)
so from that very small sample SQL Server certainly treats both the same.
In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.
Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.
vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"
Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...
Please have a look at those posts:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html
If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.