I am using the Sakila database within MySql 5.6.25.
The first query gives me the films which are providing our "store" with the highest revenue in descending order. I also show the times it has been rented and it's rental rate:
SELECT f.title, f.rental_rate, count(r.rental_id) AS "Times Rented", count(r.rental_id) * f.rental_rate as Revenue
from film f
INNER JOIN inventory i
ON f.film_id = i.film_id
INNER JOIN rental r
ON r.inventory_id = i.inventory_id
GROUP BY f.title
ORDER BY revenue DESC
The second query is showing us how many copies of the film we have on hand:
SELECT film.title, count(inventory.film_id)
from film
INNER JOIN inventory
ON film.film_id = inventory.film_id
group by film.title
I understand how both queries are working... independently... but when I try to combine them, they produce unexpected results. Please show me the correct way to combine them without changing the manner in which results are being shown.
For those unaware, mysql comes with a great practice database call sakila on which to practice queries.
correct approach (or something like it)
SELECT f.title,
f.rental_rate,
count(r.rental_id) AS "Times Rented",
count(r.rental_id) * f.rental_rate as Revenue,
(select count(*) from inventory where film_id=f.film_id) as InvCount
from film f
INNER JOIN inventory i
ON f.film_id = i.film_id
INNER JOIN rental r
ON r.inventory_id = i.inventory_id
GROUP BY f.title
ORDER BY revenue DESC
This will give you InvCount=8 for first row (film BUCKET BROTHERHOOD).
wrong approach
count(i.film_id) as InvCount
is because alias i is driven by its last join on alias rental r in the whole query.
So for the first row of output, BUCKET BROTHERHOOD, with 34 rentals and 8 actual inventory of the item ....
If you do it the wrong way it shows InvCount=34. The correct answer is 8.
Always do a sanity check with the data to keep the boss from yelling at you.
select film_id,title from film where title like 'bucket br%'; -- film_id=103
select count(*) from inventory where film_id=103; -- count=8
Related
I need to: Find the movies whose total number of actors is above the average. Return the movie names and its number of actors ordered by the title. IMPORTANT NOTE: this query should return many movies. Show only the top 10 results.
I am using mySQL workbench for this, and performing the query on the Sakila database.
[picture attached]
My code so far is:
SELECT film.title, COUNT(film_actor.actor_id) AS actor_count
FROM film_actor
INNER JOIN film ON film_actor.film_id=film.film_id
GROUP BY film.film_id
HAVING COUNT(film_actor.actor_id) >
AVG(film_actor.actor_id)
ORDER BY film.title
with cte as(
SELECT film.film_id,film.title, COUNT(film_actor.actor_id) AS actor_count
FROM film_actor
INNER JOIN film ON film_actor.film_id=film.film_id
GROUP BY film.film_id,film.title
)
select cte.title,cte.actor_count from
cte where cte.actor_count>(select avg(actor_count) from cte)
order by cte.title limit 0,10
I have 5 SQL tables
store
staff
departments
sold_items
staff_rating
I created a view that JOINs this four of the tables together. The last table (staff_rating),I want to get the rating column at a time close to when items was sold (sold_items.date) for the view rows.
I have tried the following SQL Queries which works but have performance issues.
SQL QUERY 1
SELECT s.name,
s.country,
d.name,
si.item,
si.date,
(SELECT rating
FROM staff_ratings
WHERE staff_id = s.id
ORDER BY DATEDIFF(date, si.date) LIMIT 1) AS rating,
st.name,
st.owner
FROM store st
LEFT OUTER JOIN staff s ON s.store_id = st.id
LFET JOIN departments d ON d.store_id = st.id
LEFT JOIN sold_items si ON si.store_id = st.id
SQL QUERY 2
SELECT s.name,
s.country,
d.name,
si.item,
si.date,
si.rating ,
st.name,
st.owner
FROM store st
LEFT OUTER JOIN staff s ON s.store_id = st.id
LFET JOIN departments d ON d.store_id = st.id
LEFT JOIN (SELECT *,
(SELECT rating
FROM staff_ratings
WHERE staff_id = si.staff_id
ORDER BY DATEDIFF(date, si.date) LIMIT 1) AS rating
FROM sold_items) si ON si.store_id = st.id
SQL Query 2 is faster than SQL Query 1. But Both still have performance issue. Appreciate help for a query with better performance. Thanks in advance.
Your query doesn't look right to me (as mentioned in a comment on the original post; lacking staff_id in the join on the sales, etc)
Ignoring that, one of your biggest performance hits is likely to be this...
ORDER BY DATEDIFF(date, si.date) LIMIT 1
That order by can only be answered by comparing EVERY record for that staff member to the current sales record.
What you ideally want to be able to do is find the appropriate staff rating from an index, and not to have to run computations that involve dates from both the ratings table and the sales table.
If, for example, you wanted "the most recent rating BEFORE the sale", the query can be substantially improved...
SELECT
s.name,
s.country,
d.name,
si.item,
si.date,
(
SELECT sr.rating
FROM staff_ratings sr
WHERE sr.staff_id = s.id
AND sr.date <= si.date
ORDER BY sr.date DESC
LIMIT 1
)
AS rating,
st.name,
st.owner
FROM store st
LEFT JOIN staff s ON s.store_id = st.id
LFET JOIN departments d ON d.store_id = st.id
LEFT JOIN sold_items si ON si.store_id = st.id
Then, with an index for staff_ratings(staff_id, date, rating) the optimiser can very quickly look up which rating to use, without having to scan Every Single Rating for that staff member.
Why DATEDIFF? Would something like this work better? If so, the given index will make it work much faster.
WHERE staff_id = s.id
AND s.date >= s1.date
ORDER BY s.date
LIMIT 1
And INDEX(staff_id, date)
Do you need LEFT JOIN? Perhaps plain JOIN?
d may benefit from INDEX(store_id, name)
I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()
This query works normally with MySQL
SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
But same doesn't with Spark SQL
The error I am getting is :
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Doing this actually fixes the problem
SELECT film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
Can some one explain why this is required in terms of Spark SQL ?
There is a common requirement in SQL that all non-aggregated columns in a group by query must appear in the group by clause. Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by clause.
I guess that title is not the primary key of film, so your original query is not valid standard SQL. I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.
In a database that supports functional dependency in group by, you would phase the query as:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1
I don't think Spark would understand that, so just add all the needed columns to the group by clause:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1
Notes:
having film_id in the group by clause is still good practice; in the real-life, two different movies might have the same title and rate, and you don't want to group them together
count(r.rental_id) can be simplified as count(*) (since obviously that column cannot be null
table aliases make the queries easier to write and read
I suspect that you want:
SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,
SUM(f.rental_rate) as revenue
FROM rental r JOIN
inventory i
ON r.inventory_id = i.inventory_id JOIN
film f
ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;
Notes:
In general, the GROUP BY columns should be the unaggregated columns in the SELECT. That is even required (with the default settings) in MySQL for the past few years.
You can just sum the rental_rate column. There is no need to count and multiply.
Table aliases make the query easier to write and to read.
That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it. SparkSQL (in this case) is doing what just about every other database does.
I just finished taking the MySQL course as a beginner and struggled a bit with the join and sub-queries statements.
For the question, why is my answer incorrect?:
MY RESPONSE:
SELECT f.name, r.name, COUNT(s.room_id) AS film_times FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.name = 'Chaplin';
SOLUTION:
SELECT f.name, r.name, COUNT(r.name) AS film FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.id = 1
GROUP BY f.name
ORDER BY film DESC
LIMIT 1;
The most notable problem with your query is that it is missing a GROUP BY clause, while it has an aggregate function (COUNT()) in the SELECT clause. This is just invalid SQL. Basically, if you want to count rows, you need to specify a grouping criteria
Also, you are missing the ORDER BY and LIMIT 1, which let you select the film with most occurences (that is, the group that contains most rows).
The SELECT and FROM clauses look fine - the solution filters rooms by id while you filter by name, but both should be OK (as long as there are no duplicate room names).
Finally, let me pinpoint that the solution is somehow flawed: not all non-aggregated columns appear in the GROUP BY clause, while it is a common SQL requirement (although old versions of MySQL are lax about it). Furthermore, it is grouping by film name, which (again) opens up the possibility of issues if two different films have the same name. This would better be phrased:
SELECT f.name, r.name, COUNT(*) AS no_occurences FROM films f
JOIN screenings s ON f.id = s.film_id
JOIN rooms r ON s.room_id = r.id
WHERE r.id = 1
GROUP BY f.film_d, f.name, r.name
ORDER BY no_occurences DESC
LIMIT 1;
I'm having trouble creating some queries.
I'm using Sakila DB. I am trying to create a new column with the number of delays per client, using "count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as n"...
Which are the top 10 customers with the most delays in returning movies.
Select customer.first_name, customer.last_name, count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as nTime
From customer,film,rental,inventory
Where customer.customer_id=rental.customer_id
and rental.inventory_id=inventory.inventory_id
and (datediff (rental.rental_date,rental.return_date)) > film.rental_duration
limit 10;
What am I doing wrong?
Thanks!!
I made a few assumptions, but this should be quite close to what you want:
select c.customer_id, count(*)
from
customer c
inner join rental r
on r.customer_id = c.customer_id
inner join film f
on f.film_id = r.film_id
and (datediff (r.rental_date, r.return_date)) > f.rental_duration
group by c.customer_id
order by count(*) desc
limit 10;
Problems with your query:
it is missing aggregation; you need to group records by customers, so you can compute how many late rental returns happened per customer
it is missing a join condition for the film table; I assumed that film relates to rental through column film_id
the previous issue would have been much more easier to spot if you were using standard, explicit joins instead of old-school, implicit joins; this is one of the many reasons why you should always use standard joins
as commented by Thorsten Kettner, the inventory table seems superfluous in this query: the 3 other tables contain all the information you need