I'm having trouble creating some queries.
I'm using Sakila DB. I am trying to create a new column with the number of delays per client, using "count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as n"...
Which are the top 10 customers with the most delays in returning movies.
Select customer.first_name, customer.last_name, count ((datediff (rental.rental_date, rental.return_date))> film.rental_duration as nTime
From customer,film,rental,inventory
Where customer.customer_id=rental.customer_id
and rental.inventory_id=inventory.inventory_id
and (datediff (rental.rental_date,rental.return_date)) > film.rental_duration
limit 10;
What am I doing wrong?
Thanks!!
I made a few assumptions, but this should be quite close to what you want:
select c.customer_id, count(*)
from
customer c
inner join rental r
on r.customer_id = c.customer_id
inner join film f
on f.film_id = r.film_id
and (datediff (r.rental_date, r.return_date)) > f.rental_duration
group by c.customer_id
order by count(*) desc
limit 10;
Problems with your query:
it is missing aggregation; you need to group records by customers, so you can compute how many late rental returns happened per customer
it is missing a join condition for the film table; I assumed that film relates to rental through column film_id
the previous issue would have been much more easier to spot if you were using standard, explicit joins instead of old-school, implicit joins; this is one of the many reasons why you should always use standard joins
as commented by Thorsten Kettner, the inventory table seems superfluous in this query: the 3 other tables contain all the information you need
Related
I'm trying to join 2 tables and count the number of entries for unique variables in one of the columns. In this case I'm trying to join 2 tables - patients and trials (patients has a FK to trials) and count the number of patients that show up in each trial. This is the code i have so far:
SELECT patients.trial_id, trials.title
FROM trials
JOIN(SELECT patients, COUNT(id) AS Num_Enrolled
FROM patients
GROUP BY trials) AS Trial_Name;
The Outcome I'm trying to acheive is:
Trial_Name Num_Patients
Bushtucker 5
Tribulations 7
I'm completely new to sql and have been struggling with the syntax compared to scripting languages.
It's not 100% clear from your question of the names of your columns however you are after a basic aggregation. Adjust the names of the columns if necessary:
select t.title Trial_Name, Count(*) Num_Patients
from Trials t
join Patients p on p.Trial_Id = t.Id
group by t.title;
Based on Stu-'s answer, I want to say that your column naming is wrong.But you can write query based on logic like this.
SELECT trial.title AS Trial_Name, COUNT(p.id) AS Num_Patients
FROM trial
INNER JOIN patients AS p
ON trial.patient_fk_id = p.id
GROUP BY trial.title,p.id;
I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.
I am trying to run a simple query inside Spark SQL but its throwing error unless I use first()
This query works normally with MySQL
SELECT film.title,count(rental.rental_id) as total_rentals, film.rental_rate, count(rental.rental_id) * film.rental_rate as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
But same doesn't with Spark SQL
The error I am getting is :
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'film.`rental_rate`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Doing this actually fixes the problem
SELECT film.title,count(rental.rental_id) as total_rentals, first(film.rental_rate), count(rental.rental_id) * first(film.rental_rate) as revenue
FROM rental
INNER JOIN inventory ON rental.inventory_id = inventory.inventory_id
INNER JOIN film ON inventory.film_id = film.film_id
GROUP BY film.title
ORDER BY 1
Can some one explain why this is required in terms of Spark SQL ?
There is a common requirement in SQL that all non-aggregated columns in a group by query must appear in the group by clause. Some databases understand the concept of functionally dependent-column and let you get away with putting the primary key column only in the group by clause.
I guess that title is not the primary key of film, so your original query is not valid standard SQL. I suspect that you are running this in MySQL, which (alas!) has options that allow disabling the standard requirement.
In a database that supports functional dependency in group by, you would phase the query as:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id
ORDER BY 1
I don't think Spark would understand that, so just add all the needed columns to the group by clause:
SELECT f.title, count(*) as total_rentals, f.rental_rate, count(*) * f.rental_rate as revenue
FROM rental r
INNER JOIN inventory i ON r.inventory_id = i.inventory_id
INNER JOIN film f ON i.film_id = f.film_id
GROUP BY f.film_id, f.title, f.rental_rate
ORDER BY 1
Notes:
having film_id in the group by clause is still good practice; in the real-life, two different movies might have the same title and rate, and you don't want to group them together
count(r.rental_id) can be simplified as count(*) (since obviously that column cannot be null
table aliases make the queries easier to write and read
I suspect that you want:
SELECT f.title, COUNT(*) as total_rentals, f.rental_rate,
SUM(f.rental_rate) as revenue
FROM rental r JOIN
inventory i
ON r.inventory_id = i.inventory_id JOIN
film f
ON i.film_id = f.film_id
GROUP BY f.title, f.rental_rate
ORDER BY 1;
Notes:
In general, the GROUP BY columns should be the unaggregated columns in the SELECT. That is even required (with the default settings) in MySQL for the past few years.
You can just sum the rental_rate column. There is no need to count and multiply.
Table aliases make the query easier to write and to read.
That the first SQL works in MySQL is because MySQL extends the SQL syntax to allow it. SparkSQL (in this case) is doing what just about every other database does.
The query below is grabbing some information about a category of toys and showing the most recent sale price for three levels of condition (e.g., Brand New, Used, Refurbished). The price for each sale is almost always different. One other thing - the sales table row id's are not necessarily in chronological order, e.g., a toy with a sale id of 5 could have happened later than a toy with a sale id of 10).
This query works but is not performant. It runs in a manageable amount of time, usually about 1s. However, I need to add yet another left join to include some more data, which causes the query time to balloon up to about 9s, no bueno.
Here is the working but nonperformant query:
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN (
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
) AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
But like I said it's slow. The sales table has about 200k rows.
What I tried to do was create the subquery as a view, e.g.,
CREATE VIEW sales_view AS
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
Then replace the subquery with the view, like
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN sales_view AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
Unfortunately, this change causes the query to no longer grab the most recent sale, and the sales price it returns is no longer the most recent.
Why is it that the table view doesn't return the same result as the same select as a subquery?
After reading just about every top-n-per-group stackoverflow question and blog article I could find, getting a query that actually worked was fantastic. But now that I need to extend the query one more step I'm running into performance issues. If anybody wants to sidestep the above question and offer some ways to optimize the original query, I'm all ears!
Thanks for any and all help.
The solution to the subquery performance issue was to use the answer provided here: Groupwise maximum
I thought that this approach could only be used when querying a single table, but indeed it works even when you've joined many other tables. You just have to left join the same table twice using the s.date_sold < s2.date_sold join condition and make sure the where clause looks for the null value in the second table's id column.
My database has 3 tables. One is called Customer, one is called Orders, and one is called RMA. The RMA table has the info regarding returns. I'll include a screen shot of all 3 so you can see the appropriate attributes. This is the code of the query I'm working on:
SELECT State, SKU, count(*)
from Orders INNER JOIN Customer ON Orders.Customer_ID = Customer.CustomerID
INNER JOIN RMA ON Orders.Order_ID = RMA.Reason
Group by SKU
Order by SKU
LIMIT 10;
I'm trying to get how much of each product(SKU) is returned in each state(State). Any help would really be appreciated. I'm not sure why, but anytime I include a JOIN statement, my query takes anywhere from 5 minutes to 20 minutes to process.
[ Customer table]
!2[ RMA table]
!3
Your query should look like this:
SELECT c.State, o.SKU, COUNT(*)
FROM Orders o INNER JOIN
Customer c
ON o.Customer_ID = c.CustomerID JOIN
RMA
ON o.Order_ID = RMA.Order_Id
GROUP BY c.State, o.SKU
ORDER BY SKU;
Your issue is probably the incorrect JOIN condition between Orders and RMA.
If you have primary keys properly declared on the tables, then this query should have good-enough performance.
Given you are joining with an Orders table I'm going to assume this table contains all the orders that the company has ever done. This can be quite large and would likely cause the slowness you are seeing.
You can likely improve this query if you place some constraint on the Orders you are selecting, restricting what date range you use is common way to do this. If you provide more information about what the query is for and how large the dataset is everyone will be able to provide better guidance as to what filters would work best.