How to JOIN a table on an aggregate - mysql

Issue Summary:
I need to join a single row to a table output based on an aggregate function, in this case of most recent corresponding record. Various other questions on this topic seem to work on the basis that both tables values are required (INNER JOIN, etc.) but in my case the aggregate needs to work on a LEFT JOIN table that is many times going to be NULL.
MySQL 5.7;
Here is an couple of illistrative tables:
Tables
Core_data:
-----------
create table `core`
(
core_id int(8) unsigned auto_increment primary key,
some_name varchar(80) null,
some_data varchar(80) null,
some_values .... etc.
)
Linked_data:
------------
create table `linked_data`
(
link_id smallint(6) unsigned auto_increment primary key,
core_id int(8) unsigned
data_date date,
some_linked_data_values varchar(80) null
)
I have a query dealing with dozens of tables. selecting 1 row from the Core table and selecting various LEFT JOIN data from dozens of other tables.
The illustrated linked data table has data that is dated, and the date is important, that only the most recent is returned.
Example Data:
linked_data
------------
link_id | core_id | data_date | data_value | ...
-------------------------------------------------
1 | 2 | 2020-09-03 | something | ...
2 | 4 | 2019-07-29 | whatever | ...
3 | 1 | 2017-11-09 | yews | ...
4 | 4 | 2018-04-10 | socks | ...
I want to only join the row with core_id = 4 AND the maximum date value. How can I create this within the JOIN scenario; I can't put the MAX aggregate into the JOIN ... ON condition.
My Current SQL:
My SQL is something like this:
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN linked_data ld ON core.core_id = ld.core_id AND MAX(ld.data_date)
WHERE ... core_id = value
One table I need only a result row that has the highest value of a column (data based), there is no reason for linked_data to hold any data so the LEFT JOIN may return NULL
Expected result:
For core_id = 4 I want to be able to output a single SQL row result containing linked_data.data_value = whatever . For core_id = 5 I want to be able to output the rest of the data but nothing from linked_data. table.
What Have I tried already?
This answer is noted as correct but is also noted that it will become very slow very quickly with larger amounts of data.
This answer put the qualifier in the WHERE clause, but there's no promise that linked_data will contain any result at all so, I can of course add further conditionals (check if into the WHERE clause here but I was hoping to avoid this.
This MySQL post has another possible solution but comments on this also state it is very slow (that may be user error on their part, I've not tested it yet).
I have also tried using a SELECT in the LEFT JOIN like so:
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN (
SELECT linked_data FROM linked_data ldi WHERE core.core_id = ldi.core_id AND MAX(ldi.data_date)
) as ld ON core.core_id = ldi.core_id
WHERE ... core_id = value
Referenced from this Q&A
But this still tells me Aggregate calls are not allowed here
EDIT: I found why the aggregate wasn't allowed; a simple syntax mistake on my part; but I have put up a full answer to clarify this Q&A as I couldn't find any relative answers when I was searching, so this may be useful to someone.
If anyone has a more correct way of solving the original issue please share!

In its shortest sample answer for you from what provided... You can have a pre-query resulting in an alias for the join. This pre-query can group / max per Core.
That said, and since the linked-data table is auto-increment, I would assume (yeah, I know about assume, but you can confirm) that as each record is added, the date will always be the date added. So and ID of 100 may have a date of Jan 14, 2020, you would never have an earlier date record with a higher ID such as ID 101 = Nov 3, 2019. As each ID added, higher date than last record regardless of the core id.
You can then continue your additional "left-joins" to other tables as needed.
REVISION FROM COMMENT CLARIFICATION
Martin, from the clarification you provided about the data coming from multiple sources and the date could represent older data, just revise the pre-query inner sql to the following. The query is heavily commented to clarify how the over/partition query works in this scenario
Now, integrating with all the rest of your stuff. I will only join to your primary core table as a left-join
select
cd.core_id,
cd.some_name,
cd.some_data,
cd.some_values,
ld2.link_id,
ld2.data_date,
ld2.some_linked_data_values
from
core_data cd
left join
( select pq1.*
from
-- first, all columns I want to have returned from
-- the linked_data table
( select core_id,
data_date,
link_id,
-- dense_rank() returns sequential counter value
-- starting at 1 based on every change of the
-- PARTITION BY Core_ID in next part below
dense_rank()
over ( partition by
-- primary sorting by the core_id
core_id
order by
-- then within each core_id, descending by date
data_date desc,
-- and then by the link_id descending, just in case
-- there are multiple records for the same core_id
-- AND the same date... So you get the most
-- recently added linked_data record for given core
link_id desc ) as sqlrow
from linked_data ) pq1
where
-- now, from inner partition/over query, only get the record
-- where the sqlrow = 1, as result of dense_rank()
-- that resets to 1 every time core_id changes
pq1.sqlrow=1 ) PQ
on cd.core_id = PQ.core_id
LEFT JOIN linked_data ld2
on PQ.Link_id = ld2.link_id
The inner query with OVER / PARTITION BY is basically making a first-pass of the data and ordering it first by the partition (the core id), then sub-sorted by the data_date DESCENDING (so most recent date first regardless of being added earlier or later from the import from whatever external sources), then sub-sorted by link_id descending based on the most recent record added for any given date.
The final outer WHERE clause is basically stating only give me back the first row for every core_id. So now, you have the proper critical elements to re-do the LEFT join back to the original core ID, yet have the proper link_id to get the proper record at your final query result.

After writing out this whole question, I found that a simple syntax fix on my latest example SQL resolved the correct way to do this (and I've done this in the past the correct way but had not learnt it).
I note the lack of (or rather my inability to easily find) a clear definitive answer on the Interwebs to my query, so here's my methodology...
The correct way to do this type of aggregated join with respect to getting 0 or 1 results from the joined table is as follows:
LEFT JOIN and then wrap the subquery in brackets and set it as ... as per usual.
Put the aggregation function into the subquery ignoring the join criteria; the join criteria will be in the outer query ON ... section.
use ORDER BY in the sub query and then set the aggregate at that point.
So;
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN (
SELECT linked_data FROM linked_data ldi WHERE optional = 1 ORDER BY MAX(ldi.data_date)
) as ld ON core.core_id = ldi.core_id
WHERE ... core_id = value
There is no need for a WHERE in the subquery but obviously if you do then it should be static and not need to reference the outer table, as that's done on the ON clause. I put in WHERE optional = 1 for familiarity.
The nature of LEFT JOIN is it will return 0 or 1 results so I find there's no need for much else.
If anyone has a better way of solving my original issue please let me know!
P.s> I have not efficiency tested this answer at all.

Related

Mysql view multiple join from other table, without several lookups

I'm trying to learn sql better, views more specifically but I can't get the following to work out for me.
I've put a slimmed down version of it here. There's more joins I have to do based on foreign keys from the tbl2 matches.
Since it's a view, I can't create temp tables.
I can't rely on stored procedures in this case.
I could do outer apply, but only to get specific references (row 1, 2...) and that would be by doing a Select * from Table2 where.... and that would mean 1 index scan per time I use it.
I could create the view using "With tbl2 (FK_TABLE1...) as SELECT FK_TABLE1 from dbo.TABLE2) but that doesn't seem to be helpful. Each reference to it does a sort or a scan so no gain there.
Is there some way I'm able to create some type of list that I can reuse so I can simply just run 1 index scan to get the matching ones from Table2?
Or is there another way to think about this?
Table1 (PK, XX, YY)
Table2 (PK, FK_TABLE1, Type, Progress, ZZ, FK_Status)
Create View MyView
as
Select
Table1.PK
,Table1.XX
,Table1.YY
---- I want to present data from the first 3 matches
,(SELECT ZZ from tbl2 where tbl2.FK_TABLE1 = FK_TABLE1.PK ORDER BY Type ASC OFFSET(0) ROWS FETCH NEXT (1) ROWS ONLY) ZZ1
,(SELECT ZZ from tbl2 where tbl2.FK_TABLE1 = FK_TABLE1.PK ORDER BY Type ASC OFFSET(1) ROWS FETCH NEXT (1) ROWS ONLY) ZZ2
,(SELECT ZZ from tbl2 where tbl2.FK_TABLE1 = FK_TABLE1.PK ORDER BY Type ASC OFFSET(2) ROWS FETCH NEXT (1) ROWS ONLY) ZZ3
,sts.StatusName CurrentStatus
From Table1
LEFT OUTER JOIN Table2 AS tbl2 ON (tbl2.FK_TABLE1= Table1.PK) ---- Here I want to make some sort of join so I get all matching rows from the other table
LEFT OUTER JOIN STATUS AS sts ON (sts.PK = [tbl2 ordered by type, if last elements status = X take that, else status of first).FK_STATUS) ---- Here I'm a bit puzzled, since I have to order by, but also have a fallback value if last element isn't matching.

MySQL Query for getting Payments & Invoices in one query

A customer has asked me to pull some extra accounting data for their statements. I have been having trouble writing a query that can accomodate this.
What they initially wanted was to get all "payments" from a certain date range that belongs to a certain account, ordered as oldest first, which was a simple statement as follows
SELECT * FROM `payments`
WHERE `sales_invoice_id` = ?
ORDER_BY `created_at` ASC;
However, now they want to have newly raised "invoices" as part of the statement. I cannot seem to write the correct query for this. Union does not seem to work, and JOINS behave like, well, joins, so that I cannot get a seperate row for each payment/invoice; it instead merges them together. Ideally, I would retrieve a set of results as follows:
payment.amount | invoice.amount | created_at
-----------------------------------------------------------
NULL | 250.00 | 2014-02-28 8:00:00
120.00 | NULL | 2014-02-28 8:00:00
This way I can loop through these to generate a full statement. The latest query I tried was the following:
SELECT * FROM `payments`
LEFT OUTER JOIN `sales_invoices`
ON `payments`.`account_id` = `sales_invoices`.`account_id`
ORDER BY `created_at` ASC;
The first problem would be that the "created_at" column is ambiguous, so I am not sure how to merge these. The second problem is that it merges rows and brings back duplicate rows which is incorrect.
Am I thinking about this the wrong way or is there a feature of MySQL I am not aware of that can do this for me? Thanks.
You can use union all for this. You just need to define the columns carefully:
select ip.*
from ((select p.invoice_id, p.amount as payment, NULL as invoice, p.created_at
from payments p
) union all
(select i.invoice_id, NULL, i.amount, i.created_at
from sales_invoices i
)
) ip
order by created_at asc;
Your question is specifically about MySQL. Other databases support a type of join called the full outer join, which can also be used for this type of query (MySQL does not support full outer join).

MySQL - How do I optimize appending field from table b to a query of table a

I know this has to be a fairly common issue, and I am sure the answer is readily available but I am not sure how to phrase my search so I have been forced to troubleshoot this on my own for the most part.
Table A
id | content_id | score
1 | 2 | 16
2 | 2 | 4
3 | 3 | 8
4 | 3 | 12
Table B
id | content
1 | "Content Goes Here"
2 | "Content Goes Here"
3 | "Content Goes Here"
Objective: SUM all scores from table A, group by the unique content_id and show the content associated with the id, ordered by the sum score.
Current Working Query:
SELECT a.content_id, b.content, SUM(a.score) AS sum
FROM table_a a
LEFT JOIN table_b b ON a.content_id = b.id
GROUP BY a.content_id
ORDER BY sum ASC;
Problem: As far as I can tell, with the way I have structured my query, the content is grabbed from table_b by looping through each record on table_a, checking for a record in table_b with an identical id, and grabbing the content field. The problem here is that in table_a there is nearly 500k+ records, and in table_b there is 112 records. Which means that potentially 500,000 x 112 cross table lookups/matches are being performed just to attached 112 unique content fields to a total of 112 results in the ending result set.
HELP!: How do I more efficiently append the 112 content fields from table_b to the 112 results produced by the query? I am guessing it has something to do with the query execution order, like somehow only looking for and appending the content field to the matched result row AFTER the sums are produced and it is narrowed down to only 112 records? Have studied the MySQL API and benchmarked various subqueries, several joins, and even tried playing with UNION. It is probably something abundandtly obvious to you guys, but my brain just can't get around it.
FYI: Like mentioned earlier, the query does work. The results are produced in about 8 to 10 seconds, and of course each subsequent query after that is immediate because of query caching. But for me, with how simple this is, I know that 8 seconds can at LEAST be cut in half. I just feel it deep down in my guts. Right deep down in my gutssss.
I hope this is concise enough, if I need to clarify or explain something better please let me know! Thanks in advance.
The MySQL query optimiser only allows "nested loop joins" ** These are the internal operators for how an INNER join is evaluated. Other RDBMS allow other kinds of JOINs which are more efficient.
However, in your case you can try this. Hopefully the optimiser will do the aggregate before the JOIN
SELECT
a.content_id, b.content a.sum
FROM
(
SELECT content_id, SUM(score) AS sum
FROM table_a
GROUP BY content_id
) a
JOIN table_b b ON a.content_id = b.id
ORDER BY
sum ASC;
In addition, if you don't want the results ordered you can use ORDER BY NULL which usually removes a filesort from the EXPLAIN. And of course, I assume that there are indexes on the 2 content_id columns (one primary key, one foreign key index)
Finally, I would also assume that an INNER JOIN will be enough: every a.contentid exists in tableb. If not, you are missing a foreign key and index on a.contentid
** It's getting better but you need MariaDB or MySQL 5.6
This should be a little faster:
SELECT
tmp.content_id,
b.content,
tmp.asum
FROM (
SELECT
a.content_id,
SUM(a.score) AS asum
FROM
table_a a
GROUP BY
a.content_id
ORDER BY
NULL
) as tmp
LEFT JOIN table_b b
ON tmp.content_id = b.id
ORDER BY
tmp.asum ASC
You can use EXPLAIN to check the query execution plan for both queries when you want to benchmark them

MySQL Left Outer Join, Exclude Items in Second Table Belonging to User

I have two tables in my MySQL database, one is a library of all of the books in the database, and the other is containing individual rows corresponding to which books are in a user's library.
For example:
Library Table
`id` `title`...
===== ===========
1 Moby Dick
2 Harry Potter
Collection Table
`id` `user` `book`
===== ====== =======
1 1 2
2 2 2
3 1 1
What I want to do is run a query that will show all the books that are not in a user's collection. I can run this query to show all the books not in any user's collection:
SELECT *
FROM `library`
LEFT OUTER JOIN `collection` ON `library`.`id` = `collection`.`book`
WHERE `collection`.`book` IS NULL
This works just fine as far as I can tell. Running this in PHPMyAdmin will result in all of the books that aren't in the collection table.
However, how do I restrict that to a certain user? For example, with the above dummy data, I want book 1 to result if user 2 runs the query, and no books if user 1 runs the query.
Just adding a AND user=[id] doesn't work, and with my extremely limited knowledge of JOIN statements I'm not getting anywhere really.
Also, the ID of the results being returned (of query shown, which doesn't do what I want but does function) is 0-- how do I make sure the ID returned is that of library.id?
You'll have to narrow down your LEFT JOIN selection to only the books that a particular user has, then whatever is NULL in the joined table will be rows(books) for which the user does not have in his/her collection:
SELECT
a.id,
a.title
FROM
library a
LEFT JOIN
(
SELECT book
FROM collection
WHERE user = <userid>
) b ON a.id = b.book
WHERE
b.book IS NULL
An alternative is:
SELECT
a.id,
a.title
FROM
library a
WHERE
a.id NOT IN
(
SELECT book
FROM collection
WHERE user = <userid>
)
However, the first solution is more optimal as MySQL will execute the NOT IN subquery once for each row rather than just once for the whole query. Intuitively, you would expect MySQL to execute the subquery once and use it as a list, but MySQL is not smart enough to distinguish between correlated and non-correlated subqueries.
As stated here:
"The problem is that, for a statement that uses an IN subquery, the
optimizer rewrites it as a correlated subquery."
How about this? It's just off the top of my head - I don't have access to a database to test on right now. (sorry)
SELECT
*
FROM
library lib
WHERE
lib.id NOT IN (
SELECT
book
FROM
collection coll
WHERE
coll.user =[id]
)
;

MySQL JOIN tables with WHERE clause

I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.