I'm attempting to take an existing application and re-architect the schema to support new customer requests and fix several outstanding issues (mostly around our current schema being heavily denormalized). In doing so, I've reached an interesting problem which at first glance seems to have a simple solution, but I can't seem to find the function I'm looking for.
The application is a media organization tool.
Our Old Schema:
Our old schema had separate models for "Groups", "Subgroups", and "Videos". A Group could have many Subgroups (one-to-many) and a Subgroup could have many Videos (one-to-many).
There were certain fields that were shared among Groups, Subgroups, and Videos. For instance, the Google Analytics ID to be used when the Video was embedded on a page. Whenever we displayed the embed page we would first look if the value was set on the Video. If not, we checked its Subgroup. If not, we checked its Group. The query looked roughly like so (I wish this were the real query, but unfortunately our application was written over many years by many junior developers, so the truth is much more painful):
SELECT
v.id,
COALESCE(v.google_analytics_id, sg.google_analytics_id, g.google_analytics_id) as google_analytics_id
FROM
Videos v
LEFT JOIN Subgroups sg ON sg.id = v.subgroup_id
LEFT JOIN Groups g ON g.id = sg.group_id
Pretty straight-forward. Now the issue we've run into is that customers want to be able to nest groups arbitrarily deep, and our schema clearly only allows for 2 levels (and, in fact, necessitates two levels - even if you only want one)
New Schema (First Pass):
As a first pass, I knew we'd want a basic tree structure for the Groups, so I came up with this:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
parent_id INT,
ga_id VARCHAR(20)
)
We can then easily nest up to N levels deep with N joins like so:
SELECT
v.id,
COALESCE(v.ga_id, g1.ga_id, g2.ga_id, g3.ga_id, ...) as ga_id
FROM
Videos v
LEFT JOIN Groups g1 ON g1.id = v.group_id
LEFT JOIN Groups g2 ON g2.id = g1.parent_id
LEFT JOIN Groups g3 ON g3.id = g2.parent_id
...
There's obvious flaws with this approach: We don't know how many parents there will be so we don't know how many times we should JOIN, forcing us to implement a "max depth". Then even with a max depth, if a person only has a single level of groups we still perform multiple JOINs because our queries can't know how deep they need to go. MySQL offers recursive queries, but while looking into if that was the right option I found a smarter schema that produced the same results
New Schema (Take 2):
Looking into better ways to handle a tree structure, I learned about Adjacency Lists (my prior solution), Nested Sets, Materialized Paths, and Closure Tables. Other than Adjacency Lists (which depend on JOINs to grab the entire tree structure and so produces a single row with multiple columns per node on the tree), the other three solutions all return multiple rows for each node on the tree
I ended up going with a Closure Table solution like so:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
ga_id VARCHAR(20)
)
CREATE TABLE Group_Closure (
ancestor_id INT,
descendant_id INT,
PRIMARY KEY (ancestor_id, descendant_id)
)
Now given a Video I can get all of its parents like so:
SELECT
v.id,
v.ga_id,
g.id,
g.ga_id
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor;
This returns each group in the hierarchy as a separate row:
+------+---------+------+---------+
| v.id | v.ga_id | g.id | g.ga_id |
+------+---------+------+---------+
| 1 | abc123 | 2 | new_val |
| 1 | abc123 | 1 | default |
| 2 | NULL | 4 | xyz987 |
| 2 | NULL | 3 | NULL |
| 2 | NULL | 1 | default |
| 3 | NULL | 3 | NULL |
| 3 | NULL | 1 | default |
+------+---------+------+---------+
What I wish to do now is somehow achieve the same result I would have expected from using COALESCE on multiple self-joined Group tables: a single value for ga_id based on whichever node is "lowest" in the tree
Because I have multiple rows per Video, I suspect that this can be accomplished using GROUP BY and some kind of aggregate function:
SELECT
v.id,
COALESCE(v.ga_id, FIRST_NON_NULL(g.ga_id))
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor
GROUP BY v.id, v.ga_id;
Note that because (ancestor, descendant) is my primary key, I believe the order of the group closure table can be guaranteed to always come back the same - meaning if I put the lowest node first, it will be the first row in the resulting query... If my understanding of this is incorrect, please let me know.
If you were to stick with an adjacency list, you could use a recursive CTE. This one traverses up from each video id value until it finds a non-NULL ga_id:
WITH RECURSIVE CTE AS (
SELECT id, ga_id, group_id
FROM videos
UNION ALL
SELECT CTE.id, COALESCE(CTE.ga_id, g.ga_id), g.parent_id
FROM `groups` g
JOIN CTE ON g.id = CTE.group_id AND CTE.ga_id IS NULL
)
SELECT id, ga_id
FROM CTE
WHERE ga_id IS NOT NULL
For my attempt to reconstruct your data from your question, this yields:
id ga_id
1 abc123
2 xyz987
3 default
Demo on dbfiddle
Related
I have inherited a table with information about some groups of people in which one field which contains delimited data, with the results matched to another table.
id_group Name
-----------------------
1 2|4|5
2 3|4|6
3 1|2
And in another table I have a list of people who may belong to one or more groups
id_names Names
-----------------------
1 Jack
2 Joe
3 Fred
4 Mary
5 Bill
I would like to perform a select on the group data which results in a single field containing a comma or space delimited list of names such as this from the first group row above "Joe Fred Bill"
I have looked at using a function to split the delimited string, and also looked at sub queries, but concatenating the results of sub queries quickly becomes huge.
Thanks!
As implied by Strawberry's comment above, there is a way to do this, but it's so ugly. It's like finishing your expensive kitchen remodel using duct tape. You should feel resentment toward the person who designed the database this way.
SELECT g.id_group, GROUP_CONCAT(n.Names SEPARATOR ' ') AS Names
FROM groups AS g JOIN names AS n
ON FIND_IN_SET(n.id_names, REPLACE(g.Name, '|', ','))
GROUP BY g.id_group;
Output, tested on MySQL 5.6:
+----------+---------------+
| id_group | Names |
+----------+---------------+
| 1 | Joe Mary Bill |
| 2 | Fred Mary |
| 3 | Jack Joe |
+----------+---------------+
The complexity of this query, and the fact that it will be forced to do a table-scan and cannot be optimized, should convince you of what is wrong with storing a list of id's in a delimited string.
The better solution is to create a third table, in which you store each individual member of the group on a row by itself. That is, multiple rows per group.
CREATE TABLE group_name (
id_group INT NOT NULL,
id_name INT NOT NULL,
PRIMARY KEY (id_group, id_name)
);
Then you can query in a simpler way, and you have an opportunity to create indexes to make the query very fast.
SELECT id_group, GROUP_CONCAT(names SEPARATOR ' ') AS names
FROM groups
JOIN group_name USING (id_group)
JOIN names USING (id_name)
Shadow is correct. Your primary problem is the bad design of relations in the database. Typically one designs this kind of business problems as a so-called M:N relation (M to N). To accomplish that you need 3 tables:
first table is groups that has a GroupId field with primary key on it and a readable name field (e.g. 'group1' or whatever)
second table is people that looks exactly as you showed above. (do not forget to include a primary key in the PeopleId field also here)
third table is a bridge table called GroupMemberships. That one has 2 fields GroupId and PeopleId. This table connects the first two with each other and marks the M:N relation. One group can have 1 to N members and people can be members of 1 to M groups.
Finally, just join together the tables in the select and aggregate:
SELECT
g.Name,
GROUP_CONCAT(p.Name ORDER BY p.PeopleId DESC SEPARATOR ';') AS Members
FROM
Groups AS g
INNER JOIN GroupMemberships AS gm ON g.GroupId = gm.GroupId
INNER JOIN people AS p ON gm.PeopleId = p.PeopleId
GROUP BY g.Name;
I'm passing through the following situation and have not found a good solution to this problem. I am going through a optimization of a API so am looking for fastest possible solution.
The following description is not exactly what I am doing, but I think it represents the problem well.
Let's say I have a table of products:
+----+----------+
| id | name |
+----+----------+
| 1 | product1 |
| 2 | product2 |
+----+----------+
And I have a table of attachments to each product, separate by language:
+----+----------+------------+-----------------------+
| id | language | product_id | attachment_url |
+----+----------+------------+-----------------------+
| 1 | bb | 1 | image1_bb.jpg |
| 1 | en | 1 | image1_en.jpg |
| 1 | pt | 1 | image1_pt.jpg |
| 2 | bb | 1 | image2_bb.jpg |
| 2 | pt | 1 | image2_pt.jpg |
+----+----------+------------+-----------------------+
What I intend to do is to get the correct attachment according to the language selected on the request. As you can see above, I can have several attachments to each product. We use Babel (bb) as a generic language, so every time I don't have a attachment to the right language, I should get the babel version. Is also important to consider that the Primary Key of the attachments table is a composite of id + language.
So, supposing I try to get all the data in pt, my first option to create a SQL query was:
SELECT p.id, p.name,
GROUP_CONCAT( '{',a.id,',',a.attachment_url, '}' ) as attachments_list
FROM products p
LEFT JOIN attachments a
ON (a.product_id=p.id AND (a.language='pt' OR a.language='bb'))
The problem is that, with this query I always get the bb data and I only want to get it when there is no attachment on the right language.
I already tried to do a subquery changing attachments for:
(SELECT * FROM attachments GROUP BY id ORDER BY id ASC, language DESC)
but it doubles the time of the request.
I also tried using DISTINCT inside the GROUP_CONCAT, but it only works if the whole result of each row is equal, so it does not work for me.
Does anyone knows any other solution that I can apply directly into the query?
EDIT:
Combining the answers of #Vulcronos and #Barmar made the final solution at least 2x faster than the one I first suggested.
Just to add some context, for anybody else who is looking for it. I am using Phalcon. Because of it, I had a lot of trouble putting the pieces together, as Phalcon PHQL does not support subqueries, nor a lot of the other stuff I had to use.
For my scenario, where I had to deliver approximatelly 1.2MB of JSON content, with more than 2100 objects, using custom queries made the total request time up to 3x faster than Phalcon native relations management methods (hasMany(), hasManyToMany(), etc.) and 10x faster than my original solution (which used a lot the find() method).
Try doing two joins instead of one:
SELECT p.id, p.name,
GROUP_CONCAT( '{',COALESCE(a.id, b.id),',',COALESCE(a.attachment_url, b.attachment_url), '}' ) as attachments_list
FROM products p
LEFT JOIN attachments a
ON (a.product_id=p.id AND a.language='pt')
LEFT JOIN attachments b
ON (a.product_id=p.id AND a.language='bb')
and then using COALESCE to return b instead of a if a doesn't exist. You can also do it with a subselect if the above doesn't work.
OR conditions tend to make queries slow, because it's hard to optimize them with indexes. Try joining separately using the two different languages.
SELECT p.id, p.name,
IFNULL(apt.attachment_url, abb.attachment_url) AS attachment_url
FROM products AS p
JOIN attachments AS abb ON abb.product_id = p.id
LEFT JOIN attachments AS apt ON alang.product_id = p.id AND apt.language = 'pt'
WHERE abb.language = 'bb'
This assumes that all products have a bb attachment, while pt is optional.
I left out the join of Product, because it's not relevant for this problem. It's only needed to include the product name in the resultset.
SELECT a.product_id, a.id, a.attachment_url FROM attachments a
WHERE a.language = ?
OR (a.language = 'bb'
AND NOT EXISTS
(SELECT * FROM attachments
WHERE language = ?
AND id = a.id
AND product_id = a.product_id));
Notes: problems like this usually have many possible solutions. This is not necessarily the most efficient one.
I have a remarks table which can be linked to any number of other items in a system, in the case of this example we'll use bookings, enquiries and referrals.
Thus in the remarks table we have columns
remark_id | datetime | text | booking_id | enquiry_id | referral_id
1 | 2014-06-28 | abc | 0 | 8 | 0
2 | 2014-06-27 | def | 3 | 0 | 0
2 | 2014-05-31 | ghi | 0 | 0 | 10
Etc...
Each of the item tables will have a field called name. Thus when I want to select a remark the likelihood is I'll need this name.
I'd like to achieve this with a single query, getting a 2d array as follows:
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'name'=>'Harold']
However the query I'd expect to use would be
SELECT r.remark_id,r.datetime,r.text
,b.name AS book,rr.name AS referral,e.name AS enquiry
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id
Leaving me with the output
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'book'=>'Harold', 'referral'=>'', 'enquiry'=>'']
And more processing to do before or during rendering it to a view.
Is there a way to write a query such that it would fill a field from the first NOT NULL string it encountered in one of the joined tables?
Please only suggest using a different database system if you know that MySQL doesn't provide any way to do what I'm asking. If it's the case it can't be done there's no business sense in rewriting the system anyway, but I'd like to ask!
Two ways I can think of:
use UNION:
SELECT remark_id, datetime, text, name
FROM remarks
JOIN bookings ON (remarks.book_id = bookings.book_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN referrals ON (remarks.referral_id = referrals.referral_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN enquiries ON (remarks.enquiry_id = enquiries.enquiry_id)</code>
use IFNULL (probably much slower):
SELECT r.remark_id,r.datetime,r.text,
IFNULL(b.name,IFNULL(rr.name,e.name)) AS name
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id</code>
Variant 2 is really much slower because of the LEFT JOINs.
Also, generally I would not recommend using 0 as value for non-existent links, rather use NULL. This will allow MySQL to speed up the join.
one way to achieve this is with nested if statements:
if(b.name is not null, b.name, if(rr.name is not null, rr.name, e.name)) as name
one drawback is that this gives an implicit priority to books? not sure if that would be an issue.
perhaps the main drawback, though, is that this is kind of "magical" and has goofy syntax so it might be more clear to just handle those cases in the controller after all.
Seems quite messy that you have multiple unused columns for each entry, unless I'm not understanding correctly. If you add more tables, you'd have to adjust each of the views so that it would filter out the new table.
I'd be tempted to redesign your structure so that each of the tables has a remarkgroup_id column, then add the following remark table
remark_id, remarkgroup_id, date, message
This would clean up the extra unused columns and allow you to use simple joining logic.
I have a little SQL but I can't find the way to get back text just numbers. - revised!
SELECT if( `linktype` = "group",
(SELECT contactgroups.grname
FROM contactgroups, groupmembers
WHERE contactgroups.id = groupmembers.id ???
AND contactgroups.id = groupmembers.link_id),
(SELECT contactmain.contact_sur
FROM contactmain, groupmembers
WHERE contactmain.id = groupmembers.id ???
AND contactmain.id = groupmembers.link_id) ) AS adat
FROM groupmembers;
As now I have improved a bit gives back some info but ??? (thanks to minitech) indicate my problem. I can't see how could I fix... Any advice welcomed! Thansk
Contactmain (id, contact_sur, email2)
data:
1 | Peter | email#email.com
2 | Andrew| email2#email.com
Contactgroups (id, grname)
data:
1 | All
2 | Trustee
3 | Comitee
Groupmembers (id, group_id, linktype, link_id)
data:
1 | 1 | contact | 1
2 | 1 | contact | 2
3 | 2 | contact | 1
4 | 3 | group | 2
And I would like to list out who is in the 'Comitee' the result should be Andrew and Trustee if I am right:)
It does look a bit redundant on the join since you are implying both the ID and Link_ID columns are the same value. Since BOTH select values are derived from a qualification to the group members table, I have restructured the query to use THAT as the primary table and do a LEFT JOIN to each of the other tables, anticipating from your query that the link should be found from ONE or the OTHER tables. So, with each respective LEFT JOIN, you will go through the GroupMembers table only ONCE. Now, your IF(). Since the group members is the basis, and we have BOTH tables available and linked, we just grab the column from one table vs the other respectively. I've included the "linktype" too just for reference purposes. By using the STRAIGHT_JOIN will help the engine from trying to change the interpretation of how to join the tables.
SELECT STRAIGHT_JOIN
gm.linktype,
if( gm.linktype = "group", cg.grname, cm.contact_sur ) ADat
from
groupmembers gm
left join contactgroups cg
ON gm.link_id = cg.id
left join contactmain cm
ON gm.link_id = cm.id
If contactgroups.id must equal groupmembers.id but must also equal 2, that's redundant and also probably where your problem is. It works fine as you've written it: http://ideone.com/7EGLZ so without knowing what it's actually supposed to do I can't help more.
EDIT: I'm unfamiliar with the comma-separated FROM, but it gives the same result since you don't select anything from the other table so it doesn't really matter.
I have a MySQL database table with this structure:
table
id INT NOT NULL PRIMARY KEY
data ..
next_id INT NULL
I need to fetch the data in order of the linked list. For example, given this data:
id | next_id
----+---------
1 | 2
2 | 4
3 | 9
4 | 3
9 | NULL
I need to fetch the rows for id=1, 2, 4, 3, 9, in that order. How can I do this with a database query? (I can do it on the client end. I am curious if this can be done on the database side. Thus, saying it's impossible is okay (given enough proof)).
It would be nice to have a termination point as well (e.g. stop after 10 fetches, or when some condition on the row turns true) but this is not a requirement (can be done on client side). I (hope I) do not need to check for circular references.
Some brands of database (e.g. Oracle, Microsoft SQL Server) support extra SQL syntax to run "recursive queries" but MySQL does not support any such solution.
The problem you are describing is the same as representing a tree structure in a SQL database. You just have a long, skinny tree.
There are several solutions for storing and fetching this kind of data structure from an RDBMS. See some of the following questions:
"What is the most efficient/elegant way to parse a flat table into a tree?"
"Is it possible to make a recursive SQL query ?"
Since you mention that you'd like to limit the "depth" returned by the query, you can achieve this while querying the list this way:
SELECT * FROM mytable t1
LEFT JOIN mytable t2 ON (t1.next_id = t2.id)
LEFT JOIN mytable t3 ON (t2.next_id = t3.id)
LEFT JOIN mytable t4 ON (t3.next_id = t4.id)
LEFT JOIN mytable t5 ON (t4.next_id = t5.id)
LEFT JOIN mytable t6 ON (t5.next_id = t6.id)
LEFT JOIN mytable t7 ON (t6.next_id = t7.id)
LEFT JOIN mytable t8 ON (t7.next_id = t8.id)
LEFT JOIN mytable t9 ON (t8.next_id = t9.id)
LEFT JOIN mytable t10 ON (t9.next_id = t10.id);
It'll perform like molasses, and the result will come back all on one row (per linked list), but you'll get the result.
If what you are trying to avoid is having several queries (one for each node) and you are able to add columns, then you could have a new column that links to the root node. That way you can pull in all the data at once by the root id, but you will still have to sort the list (or tree) on the client side.
So in this is example you would have:
id | next_id | root_id
----+---------+---------
1 | 2 | 1
2 | 4 | 1
3 | 9 | 1
4 | 3 | 1
9 | NULL | 1
Of course the disadvantage of this as opposed to traditional linked lists or trees is that the root cannot change without writing on an order of magnitude of O(n) where n is the number of nodes. This is because you would have to update the root id for each node. Fortunately though you should always be able to do this in a single update query unless you are dividing a list/tree in the middle.
This is less a solution and more of a workaround but, for a linear list (rather than the tree Bill Karwin mentioned), it might be more efficient to use a sort column on your list. For example:
TABLE `schema`.`my_table` (
`id` INT NOT NULL PRIMARY KEY,
`order` INT,
data ..,
INDEX `ix_order` (`sort_order` ASC)
);
Then:
SELECT * FROM `schema`.`my_table` ORDER BY `order`;
This has the disadvantage of slower inserts (you have to reposition all sorted elements past the insertion point) but should be fast for retrieval because the order column is indexed.