Combining three tables when two have the same foreign key - mysql

Hello I am making some accounting software that allows you to store transactions across many months (stored in table transactions) and categorize those transactions based on a list of available categories (stored in table categories). Finally each month the user can create a budget where they build a list (from a subset of categories) and assign a goal to each entry. These lists are stored in table budget.
I need a query that returns a spending and budget summary for a specified month. This summary would be a table of categories.name, budget.goal, and sum(transactions.amount).
Sometimes a user would have a budget item for the specified month but hasn't made any transactions with that category yet (Out to Eat example). Sometimes a user would have an unexpected expense they didn't budget for (Auto Repair Example) and there will be some categories (like vacation) where the user didn't budget that item and there were no expenses of that category.
SELECT categories.id, categories.name, SUM(transactions.amount)
FROM categories
LEFT JOIN transactions ON categories.id=transactions.category_id
WHERE transactions.date LIKE '2019-08-%' GROUP BY categories.id;
gets me half of what I want and
SELECT categories.id, categories.name, budgets.goal
FROM categories
LEFT JOIN budgets ON categories.id=budgets.category_id
WHERE budgets.date LIKE '2019-08-%' GROUP BY categories.id;
gets me the other half of what I want. Is there a single query that can return results like what I have pictured above? I would be even more thrilled if we could exclude results where both goal and sum are NULL.

If you know that there can be at most one budget per category and month, then you can (LEFT) JOIN both (child) tables to categories:
SELECT
c.id,
categories.name,
MAX(b.goal) as budget_goal,
SUM(t.amount) as total_txn_amount
FROM categories c
LEFT JOIN budgets b
ON c.id = b.category_id
AND b.date LIKE '2019-08-%'
LEFT JOIN transactions t
ON c.id = t.category_id
GROUP BY categories.id
HAVING COALESCE(budget_goal, total_txn_amount) IS NOT NULL;
Note that while we know that there can be only one budget per group, the engine doesn't, and might claim that b.goal must be either in the GROUP BY clause or used in an aggregate function. So we use MAX(b.goal) to avoid that error.
To improve the performance of the first JOIN I would change
AND b.date LIKE '2019-08-%'
to
AND b.date >= '2019-08-01'
AND b.date < '2019-08-01' + INTERVAL 1 MONTH
and create a composite index on (category_id, date).
Also to force the uniqueness of category and month combination in the budgets table, I would create a virtual column like
year_month VARCHAR(7) as (LEFT(date, 7))
and a UNIQUE KEY on (category_id, year_month)
Then you can use
LEFT JOIN budgets b
ON c.id = b.category_id
AND b.date = '2019-08'

You can try below query to find the proper result :
SELECT c.name,b.goal,sum(ct.amount) FROM era.categories c left join budget b on b.cat_id=c.id left join transactions ct on ct.cat_id=c.id WHERE b.date LIKE '2019-08-%' group by ct.cat_id;
Thanks.

Related

MySQL View in place of subquery does not return the same result

The query below is grabbing some information about a category of toys and showing the most recent sale price for three levels of condition (e.g., Brand New, Used, Refurbished). The price for each sale is almost always different. One other thing - the sales table row id's are not necessarily in chronological order, e.g., a toy with a sale id of 5 could have happened later than a toy with a sale id of 10).
This query works but is not performant. It runs in a manageable amount of time, usually about 1s. However, I need to add yet another left join to include some more data, which causes the query time to balloon up to about 9s, no bueno.
Here is the working but nonperformant query:
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN (
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
) AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
But like I said it's slow. The sales table has about 200k rows.
What I tried to do was create the subquery as a view, e.g.,
CREATE VIEW sales_view AS
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
Then replace the subquery with the view, like
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN sales_view AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
Unfortunately, this change causes the query to no longer grab the most recent sale, and the sales price it returns is no longer the most recent.
Why is it that the table view doesn't return the same result as the same select as a subquery?
After reading just about every top-n-per-group stackoverflow question and blog article I could find, getting a query that actually worked was fantastic. But now that I need to extend the query one more step I'm running into performance issues. If anybody wants to sidestep the above question and offer some ways to optimize the original query, I'm all ears!
Thanks for any and all help.
The solution to the subquery performance issue was to use the answer provided here: Groupwise maximum
I thought that this approach could only be used when querying a single table, but indeed it works even when you've joined many other tables. You just have to left join the same table twice using the s.date_sold < s2.date_sold join condition and make sure the where clause looks for the null value in the second table's id column.

MySQL Compare Result in WHERE clause

I imagine I'm missing something pretty obvious here.
I'm trying to display a list of 'bookings' where the total charges is higher than the total payments for the booking. The charges and payments are stored in separate tables linked using foreign keys.
My query so far is:
SELECT `booking`.`id`,
SUM(`booking_charge`.`amount`) AS `charges`,
SUM(`booking_payment`.`amount`) AS `payments`
FROM `booking`
LEFT JOIN `booking_charge` ON `booking`.`id` = `booking_charge`.`booking_id`
LEFT JOIN `booking_payment` ON `booking`.`id` = `booking_payment`.`booking_id`
WHERE `charges` > `payments` ///this is the incorrect part
GROUP BY `booking`.`id`
My tables look something like this:
Booking (ID)
Booking_Charge (Booking_ID, Amount)
Booking_Payment (Booking_ID, Amount)
MySQL doesn't seem to like comparing the results from these two tables, I'm not sure what I'm missing but I'm sure it's something which would be possible.
try HAVING instead of WHERE like this
SELECT `booking`.`id`,
SUM(`booking_charge`.`amount`) AS `charges`,
SUM(`booking_payment`.`amount`) AS `payments`
FROM `booking`
LEFT JOIN `booking_charge` ON `booking`.`id` = `booking_charge`.`booking_id`
LEFT JOIN `booking_payment` ON `booking`.`id` = `booking_payment`.`booking_id`
GROUP BY `booking`.`id`
HAVING `charges` > `payments`
One of the problems with the query is the cross join between rows from `_charge` and rows from `_payment`. It's a semi-Cartesian join. Each row returned from `_charge` will be matched with each row returned from `_payment`, for a given `booking_id`.
Consider a simple example:
Let's put a single row in `_charge` for $40 for a particular `booking_id`.
And put two rows into `_payment` for $20 each, for the same `booking_id`.
The query will would return total charges of $80. (= 2 x $40). If there were instead five rows in \'_payment\' for $10 each, the query would return a total charges of $200 ( = 5 x $40)
There's a couple of approaches to addressing that issue. One approach is to do the aggregation in an inline view, and return the total of the charges and payments as a single row for each booking_id, and then join those to the booking table. With at most one row per booking_id, the cross join doesn't give rise to the problem of "duplicating" rows from _charge and/or _payment.
For example:
SELECT b.id
, IFNULL(c.amt,0) AS charges
, IFNULL(p.amt,0) AS payments
FROM booking b
LEFT
JOIN ( SELECT bc.booking_id
, SUM(bc.amount) AS amt
FROM booking_charge bc
GROUP BY bc.booking_id
) c
ON c.booking_id = b.id
LEFT
JOIN ( SELECT bp.booking_id
, SUM(bp.amount) AS amt
FROM booking_payment bp
GROUP BY bp.booking_id
) p
ON p.booking_id = b.id
WHERE IFNULL(c.amt,0) > IFNULL(p.amt,0)
We could make use of a HAVING clause, in place of the WHERE.
The query in this answer is not the only way to get the result, nor is it the most efficient. There are other query patterns that will return an equivalent result.

I can't wrap my head around joins

So, alright, I have a few tables. My current query runs against a "historical" table. I want to do a join of some kind to get the most recent status from my Current table. These tables share a like column, called "ID"
Here's the structure
ddCurrent
-ID
-Location
-Status
-Time
ddHistorical
-CID (AI field to keep multiple records per site)
-ID
-Location
-Status
-Time
My goal now is to do a simple join to get all the variables from ddHistorical and the current Status from ddCurrent.
I know that they can be joined on ID since both of them have the same items in their ID tables, I just can't figure out which kind of join is appropriate or why?
I'm sure someone may provide a specific link that goes into great detail explaining, but I'll try to summarize it this way. When writing a query, I try to list the tables from the position of what table do I want to get data from and have that as my first table in the "FROM" clause. Then, do "JOIN" criteria to other tables based on relationships (such as IDs). In your example
FROM
ddHistorical ddH
INNER JOIN ddCurrent ddC
on ddH.ID = ddC.ID
In this case, INNER JOIN (same as JOIN) the ddHistorical table is the left table(listed first for my styling consistency and indentation) and ddCurrent is the right table. Notice my ON criteria that joins them together is also left alias.column = right alias table.column -- again, this is just for mental correlation purposes.
an Inner Join (or JOIN) means a record MUST have a match on each side, otherwise it is discarded.
A LEFT JOIN means give me all records in the LEFT table (ddHistorical in this case), regardless of a matching in the right-side table (ddCurrent). Not practical in this example.
A RIGHT JOIN is the reverse... give me all records from the RIGHT-side table REGARDLESS of a matching record in the left side table. Most of the time you will see LEFT-JOINs more frequently than RIGHT-JOINs.
Now, a sample to mentally get the left-join. You work at a car dealership and have a master table of 10 cars that are sold. For a given month, you want to know what IS NOT selling. So, start with the master table of all cars and look at the sales table for what DID sell. If there is NO such sales activity the right-side table will have NULL value
select
M.CarID,
M.CarModel
from
MasterCarsList M
LEFT JOIN CarSales CS
on M.CarID = CS.CarID
AND month( CS.DateSold ) = 4
where
CS.CarID IS NULL
So, my LEFT join is based on a matching car ID -- AND -- the month of sales activity is 4 (April) as I may not care about sales for Jan-Mar -- but would also qualify year too, but this is a simple sample.
If there is no record in the Car Sales table it will have a NULL value for all columns. I just happen to care about the car ID column since that was the join basis. That is why I am including that in the WHERE clause. For all other types of cars that DO have a sale it will have a value.
This is a common approach you will see in querying where someone looking for all regardless of other... Some use a where NOT EXIST ( subselect ), but those perform slower because they test on every record. Having joins is much faster.
Other examples may be you want a list of all employees of a company, and if they had some certification / training to show it... You still want all employees, but LEFT-JOINING to some certification/training table would expose those extra field as needed.
select
Emp.FullName,
Cert.DateCertified
FROM
Employees Emp
Left Join Certifications Cert
on Emp.EmpID = Cert.EmpID
Hopefully these samples help you understand better the relationship for queries, and now to actually provide answer for your needs.
If what you want is a list of all "Current" items and want to look at their historical past, I would use current FIRST. This might be if your current table of things is 50, but historically your table had 420 items. You don't care about the other 360 items, just those that are current and the history of those.
select
ddC.WhateverColumns,
ddH.WhateverHistoricalColumns
from
ddCurrent ddC
JOIN ddHistorical ddH
on ddC.ID = ddH.ID
If there is always a current field then a simple INNER JOIN will do it
SELECT a.CID, a.ID, a.Location, a.Status, a.Time, b.Status
FROM ddHistorical a
INNER JOIN ddCurrent b
ON a.ID = b.ID
An INNER JOIN will omit any ddHistorical rows that don't have a corresponding ID in ddCurrent.
A LEFT JOIN will include all ddHistorical rows, even if they don't have a corresponding ID in ddCurrent, but the ddCurrent values will be null (because they're unknown).
Also note that a LEFT JOIN is just a specific type of outer join. Don't bother with the others yet - 90% or more of what you'll ever do will be INNER or LEFT.
To include only those ddHistorical rows where the ID is in ddCurrent:
SELECT h.CID, h.ID, h.Location, h.Status, c.Status, h.Time
FROM ddHistorical h
INNER JOIN ddCurrent c ON h.ID = c.ID
If you want to include ddHistorical rows even if the ID isn't in ddCurrent:
SELECT h.CID, h.ID, h.Location, h.Status, c.Status, h.Time
FROM ddHistorical h
LEFT JOIN ddCurrent c ON h.ID = c.ID
If all ddHistorical rows happen to match an ID in ddCurrent, note that both queries will return the same result.

Show earliest reservations where books are available

I have three tables: loans, reservations and books. The books table has an attribute for "noOfCopies", and the total number of loans and reservations for that book can not exceed that figure.
The reservations table has a "timestamp" column which is just the timestamp that the reservation was made. The idea is that when a book is returned, the earliest reservation gets the book next.
Here is what I need help with: I need to create an SQL view that will show all the earliest reservations for each book, but only where that book is available.
Can anyone give me the SQL for this? Thanks in advance.
Here is the SQL I already had: I thought it showed all the reservations where the books were available and was about to move on to figuring out how to show the earliest - but then I got nowhere with earliest and then realised that this doesn't actually work anyway:
CREATE VIEW `view_bookLoans` AS
SELECT count(l.`id`) as loanCount, l.`bookISBN`,b.`noOfCopies` FROM
loans l INNER JOIN books b
ON l.`bookISBN` = b.`ISBN`
GROUP BY l.`bookISBN`;
CREATE VIEW `view_reservationList` AS
SELECT
r.`timestamp`,
b.`title` as `bookTitle`,
r.`readerID`,
bl.`loanCount`
FROM
`reservations` r INNER JOIN `books` b
ON r.`bookISBN` = b.`ISBN`
LEFT JOIN
view_bookLoans bl
ON bl.`bookISBN` = b.`ISBN`
WHERE
(b.`noOfCopies` - bl.`loanCount`) > 0;
With exception to the comments I put in the question about how to exclude reservations already filled, or loans already returned... This SHOULD do it for you.
I would start the list by only looking at those books that people have reserved. Why query and entire list of books where nobody is interested in them... THEN, expand your criteria. The inner query starts directly with the books on the reservations list. That is LEFT JOINED to the loans table (in case there are none remaining on loan). I'm extracting the ealiest time stamp for the reservation per book, and also getting the total count of LOAN book entries grouped by ISBN.
From that result, I immediately re-join to the reservations to match based on the ISBN and timestamp of the earliest to get the WHO wanted it.
Now the finale... JOIN (not a LEFT JOIN) to the books table on both the ISBN AND your qualifier that the number of copies available... less how many are already loaned out is > 0.
You can obviously add an order by clause however you want.
ENSURE that you have an index on the reservations table on (bookISBN, timestamp) for query optimization. Also, the loans table too should have an index on ISBN.
SELECT
B.Title,
R2.readerID,
R2.timeStamp,
Wanted.AlreadyLoandedOut,
B.noOfCopies - Wanted.AlreadyLoanedOut as ShouldBeAvailable
FROM
( select
R.bookISBN,
MIN( R.timeStamp ) EarliestReservation,
COALESCE( COUNT( L.ID ), 0 ) as AlreadyLoanedOut
from
reservations R
LEFT JOIN Loans L
ON R.bookISBN = L.bookISBN
group by
R.bookISBN ) as Wanted
JOIN reservations R2
ON Wanted.bookISBN = R2.bookISBN
AND Wanted.EarliestReservation = R2.timeStamp
JOIN books B
ON Wanted.bookISBN = B.ISBN
AND B.noOfCopies - Wanted.AlreadyLoanedOut > 0

Select corresponding records from another table, but just the last one

I have 2 tables authors and authors_sales
The table authors_sales is updated each hour so is huge.
What I need is to create a ranking, for that I need to join both tables (authors has all the author data while authors_sales has just sales numbers)
How can I create a final table with the ranking of authors ordering it by sales?
The common key is the: authorId
I tried with LEFT JOIN but I must be doing something wrong because I get all the authors_sales table, not just the last.
Any tip in the right direction much appreciated
If you're looking for aggregate data of the sales, you'd want to join the tables, group by the authorId. Something like...
select authors.author_id, SUM(author_sales.sale_amt) as total_sales
from authors
inner join author_sales on author_sales.author_id = authors.author_id
group by authors.author_id
order by total_sales desc
However (I couldn't distinguish from your question whether the above scenario or next is true), if you're only looking for the max value of the author_sales table (if the data in this table is already aggregated), you can join on a nested query for author_sales, such as...
select author.author_id, t.sales from authors
inner join
(select top 1 author_sales.author_id,
author_sales.sale_amt,
author_sales.some_identifier
from author_sales order by some_identifier desc) t
on t.author_id = author.author_id
order by t.sales desc
The some_identifier would be how you determine which record is the most recent for author_sales, whether it is a timestamp of when it was inserted or an incremental primary key, however it is set up. Depending on if the data in author_sales is aggregated already, one of these two should do it for you...
select a.*, sum(b.sales)
from authors as a
inner join authors_sales as b
using authorId
group by b.authorId
order by sum(b.sales) desc;
/* assuming column sales = total for each row in authors_sales */