Spark SQL vs Normal SQL query error without using first() - mysql

I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()
This query works normally with MySQL
SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
But same doesn't with Spark SQL
The error I am getting is :
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Doing this actually fixes the problem
SELECT film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
Can some one explain why this is required in terms of Spark SQL ?

There is a common requirement in SQL that all non-aggregated columns in a group by query must appear in the group by clause. Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by clause.
I guess that title is not the primary key of film, so your original query is not valid standard SQL. I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.
In a database that supports functional dependency in group by, you would phase the query as:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1
I don't think Spark would understand that, so just add all the needed columns to the group by clause:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1
Notes:
having film_id in the group by clause is still good practice; in the real-life, two different movies might have the same title and rate, and you don't want to group them together
count(r.rental_id) can be simplified as count(*) (since obviously that column cannot be null
table aliases make the queries easier to write and read

I suspect that you want:
SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,
SUM(f.rental_rate) as revenue
FROM rental r JOIN
inventory i
ON r.inventory_id = i.inventory_id JOIN
film f
ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;
Notes:
In general, the GROUP BY columns should be the unaggregated columns in the SELECT. That is even required (with the default settings) in MySQL for the past few years.
You can just sum the rental_rate column. There is no need to count and multiply.
Table aliases make the query easier to write and to read.
That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it. SparkSQL (in this case) is doing what just about every other database does.

Related

All my derived tables have aliases! Why am I getting 'Error 1248: Every derived table must have it's own alias'?

Every time I run this query, I get:
ERROR 1248 (42000): Every derived table must have its own alias
As you can see, we have a parent query which left joins against a derived table created by a subquery.
This subquery in turn selects from a second derived table, and inner joins a third derived table.
All three derived table have proper aliases (n1, n2 and subquery)
The subquery executes as expected when I execute it independently. The issue only occurs when I wrap it in the parent query.
Query:
SELECT DATE_FORMAT(p.date_admitted, '%Y-%m') as month_admitted,
diagnosis as diagnosis,
education as education,
COUNT(*) as total
FROM patient_discharge_form d
INNER JOIN survey_data sd ON sd.id = d.id
LEFT JOIN submission s ON s.id = sd.submission_id
LEFT JOIN patient p ON p.id = s.patient_id
LEFT JOIN (
SELECT n1.id,n1.diagnosis,n2.education
FROM (
SELECT id,'Gest Hyp' as diagnosis FROM patient_discharge_form WHERE gestational_hypertension=1
UNION ALL
SELECT id,'Pre w/ Sev' FROM patient_discharge_form WHERE preeclampsia_non_severe=1
) n1
INNER JOIN (
SELECT id,'Written' as education FROM patient_discharge_form WHERE education LIKE '%written%'
UNION ALL
SELECT id,'Verbal' FROM patient_discharge_form WHERE education LIKE '%verbal%'
) n2 ON n1.id=n2.id
) subquery ON d.id = subquery.id
WHERE (s.status = 'complete')
GROUP BY month_admitted, diagnosis, education
Hmmm . . . I don't see a problem with the table aliases in the query. I wonder if this is exactly the query you are running.
However, I do see a problem with the column aliases. The education column in the select (and perhaps diagnosis as well) is ambiguous. It could come from either d or subquery and perhaps other tables as well.
In general, you should qualify all column names in a query to avoid problems.

MySQL:Why was my answer incorrect: Which film was shown in the Chaplin room most often in October 2017?

I just finished taking the MySQL course as a beginner and struggled a bit with the join and sub-queries statements.
For the question, why is my answer incorrect?:
MY RESPONSE:
SELECT f.name, r.name, COUNT(s.room_id) AS film_times FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.name = 'Chaplin';
SOLUTION:
SELECT f.name, r.name, COUNT(r.name) AS film FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.id = 1
GROUP BY f.name
ORDER BY film DESC
LIMIT 1;
The most notable problem with your query is that it is missing a GROUP BY clause, while it has an aggregate function (COUNT()) in the SELECT clause. This is just invalid SQL. Basically, if you want to count rows, you need to specify a grouping criteria
Also, you are missing the ORDER BY and LIMIT 1, which let you select the film with most occurences (that is, the group that contains most rows).
The SELECT and FROM clauses look fine - the solution filters rooms by id while you filter by name, but both should be OK (as long as there are no duplicate room names).
Finally, let me pinpoint that the solution is somehow flawed: not all non-aggregated columns appear in the GROUP BY clause, while it is a common SQL requirement (although old versions of MySQL are lax about it). Furthermore, it is grouping by film name, which (again) opens up the possibility of issues if two different films have the same name. This would better be phrased:
SELECT f.name, r.name, COUNT(*) AS no_occurences FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.id = 1
GROUP BY f.film_d, f.name, r.name
ORDER BY no_occurences DESC
LIMIT 1;

Queries SQL - Sakila BD

I'm having trouble creating some queries.
I'm using Sakila DB. I am trying to create a new column with the number of delays per client, using "count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as n"...
Which are the top 10 customers with the most delays in returning movies.
Select customer.first_name, customer.last_name, count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as nTime
From customer,film,rental,inventory
Where customer.customer_id=rental.customer_id
and rental.inventory_id=inventory.inventory_id
and (datediff (rental.rental_date,rental.return_date)) > film.rental_duration
limit 10;
What am I doing wrong?
Thanks!!
I made a few assumptions, but this should be quite close to what you want:
select c.customer_id, count(*)
from
customer c
inner join rental r
on r.customer_id = c.customer_id
inner join film f
on f.film_id = r.film_id
and (datediff (r.rental_date, r.return_date)) > f.rental_duration
group by c.customer_id
order by count(*) desc
limit 10;
Problems with your query:
it is missing aggregation; you need to group records by customers, so you can compute how many late rental returns happened per customer
it is missing a join condition for the film table; I assumed that film relates to rental through column film_id
the previous issue would have been much more easier to spot if you were using standard, explicit joins instead of old-school, implicit joins; this is one of the many reasons why you should always use standard joins
as commented by Thorsten Kettner, the inventory table seems superfluous in this query: the 3 other tables contain all the information you need

Very slow sql query for count

I need get report count for each user role, but my sql query very slow (40 sec on good server). My sql query:
SELECT `auth_assignment`.`item_name`, COUNT(*) as count
FROM `report`
LEFT JOIN `company` ON company.id = report.company_id
LEFT JOIN `auth_assignment`
ON auth_assignment.user_id = company.user_id
GROUP BY `auth_assignment`.`item_name`
ORDER BY `count`
auth_assignment.item_name is role type.
auth_assignment has ~23k rows.
company ~11k rows.
reports ~12k rows (one company can have many reports).
report.id and company.id, have binding
First, you are aggregating on a column from the third table in a left join. I'm guessing you don't want NULL for the value, so use inner join or change the order of the tables.
Table aliases make the query easier to write and to read:
SELECT aa.item_name, COUNT(*) as cnt
FROM report r JOIN
company c
ON c.id = r.company_id JOIN
auth_assignment aa
ON aa.user_id = c.user_id
GROUP BY aa.item_name
ORDER BY cnt;
Assuming the join's are correct for the tables, then you just want to be sure that you have indexes. These should go on the columns used for the joins: company(id, user_id), auth_assignment(user_id, item_name).

How can I combine these two SQL queries into a single query?

I am using the Sakila database within MySql 5.6.25.
The first query gives me the films which are providing our "store" with the highest revenue in descending order. I also show the times it has been rented and it's rental rate:
SELECT f.title, f.rental_rate, count(r.rental_id) AS "Times Rented", count(r.rental_id) * f.rental_rate as Revenue
from film f
INNER JOIN inventory i
ON f.film_id = i.film_id
INNER JOIN rental r
ON r.inventory_id = i.inventory_id
GROUP BY f.title
ORDER BY revenue DESC
The second query is showing us how many copies of the film we have on hand:
SELECT film.title, count(inventory.film_id)
from film
INNER JOIN inventory
ON film.film_id = inventory.film_id
group by film.title
I understand how both queries are working... independently... but when I try to combine them, they produce unexpected results. Please show me the correct way to combine them without changing the manner in which results are being shown.
For those unaware, mysql comes with a great practice database call sakila on which to practice queries.
correct approach (or something like it)
SELECT f.title,
f.rental_rate,
count(r.rental_id) AS "Times Rented",
count(r.rental_id) * f.rental_rate as Revenue,
(select count(*) from inventory where film_id=f.film_id) as InvCount
from film f
INNER JOIN inventory i
ON f.film_id = i.film_id
INNER JOIN rental r
ON r.inventory_id = i.inventory_id
GROUP BY f.title
ORDER BY revenue DESC
This will give you InvCount=8 for first row (film BUCKET BROTHERHOOD).
wrong approach
count(i.film_id) as InvCount
is because alias i is driven by its last join on alias rental r in the whole query.
So for the first row of output, BUCKET BROTHERHOOD, with 34 rentals and 8 actual inventory of the item ....
If you do it the wrong way it shows InvCount=34. The correct answer is 8.
Always do a sanity check with the data to keep the boss from yelling at you.
select film_id,title from film where title like 'bucket br%'; -- film_id=103
select count(*) from inventory where film_id=103; -- count=8