I want to summarize rows from one end of a relationship tree with a table on the other side. Is "correlate" the correct term? Really just knowing the terms would help me solve this problem.
I am using MySQL and am extending an existing DB structure - though would have the liberty to rearrange data if needed. I'm getting better at creating "filtering" queries using JOINs, and I'm sure this next piece will be straight-forward once I understand it (without performing tons of queries : )
I made a simplified schema (and theme!) for this example, but the idea is the same.
Say there are many DietPlans, which is related to a bunch of MenuItems and each MenuItem has an ItemType (such as 'Healthy','Fast','Normal', etc.) On the other side of DietPlan there are Persons, who each store how many DailyCalories they consume, and another table MenuAllocations, where a Person stores how much percent of their daily intake is from what MenuItem.
As examples of scale, There could be 1000 MenuItems, and 50 of those associated with each of 200 DietPlans. Also, each DietPlan might have 10,000 Persons, who each will have 5-10 MenuAllocations of various types.
What I'd like to do feels complex to me. I want to create a dashboard for each DietPlan (there could be many), gathering data from the Persons of that DietPlan, and tabulating the number of calories for each item type.
The math is simple: tblPerson.dailyCalories * tblMenuAllocations.percent. But I want to do that for each Person in the DietPlan, for each ItemType.
I understand the JOINs required to 'filter' from tblItemType around to tblMenuAllocation and think it would be similar to this:
SELECT *
FROM tblMenuAllocation
INNER JOIN tblPerson
on personId = PersonId
INNER JOIN tblDietPlan
on tblPerson.dietPlanId = tblDietPlan.DietPlanId
INNER JOIN tblMenuItem
on tblMenuItem.dietPlanId = tblDietPlan.DietPlanId
INNER JOIN tblItemTyp
on ItemTypeId = itemTypeId
WHERE ItemTypeId = 2
It feels like one query for each tblItemType, which could be a LOT of Person and MenuAllocation data to sort through, and doing that many consecutive queries feels like I'm missing something. Also, I think math can be handled in the query to sum values, but I've never done that. Where can I begin?
EDIT: The final results would be something like this:
----------------------------------------------
ItemId | ItemDesc | TotalCalories
----------------------------------------------
1 Healthy 450,876
2 Fast 1,987,948
3 Vegan 349,123
etc.
I would be willing to accept some manipulation of data outside the query, but the Person's specific dailyCalories is very important to the tblMenuAllocation.percent calculation. Some tblMenuAllocation rows might be of the same ItemType!
I think you are looking for these topics :
Aggregate Functions and
Group By Modifiers
Related
Something really bugs me and im not sure what is the "correct" approach.
If i make a select to get contacts from my database there are a decent amount of joins involved.
It will look something like this (around 60-70 columns):
SELECT *
FROM contacts
LEFT JOIN company
LEFT JOIN person
LEFT JOIN address
LEFT JOIN person_communication
LEFT JOIN company_communication
LEFT JOIN categories
LEFT JOIN notes
company and person are 1:1 cardinality so its straight forward.
But "address", "communication" and "categories" are 1:n cardinality.
So depending on the amount of rows in the 1:n tables i will get a lot of "double" rows (I don't know whats the real term for that, the rows are not double i know that the address or phone number etc is different). For myself as a contact, a fairly filled contact, i get 85 rows back.
How do you guys work with that?
In my PHP application i always wrote some "Data-Mapper" where the array key was the "contact.ID aka primary" and then checked if it exists and then pushed the additional data into it. Also PHP is not really type strict what makes it easy.
Now I'm learning GO(golang) and i thought screw that LOOOONG select and data mapping just write selects for all the 1:n.... yeah no, not enough connections to load a table full of contacts. I know that i can increase the connections but the error seems to imply that this would be the wrong way.
I use the following driver: https://github.com/go-sql-driver/mysql
I also tried GROUP_CONCAT but then i running in trouble parsing it back.
Do i have to do my mapping approach again or is there some nice solution out there? I found it quite dirty at points tho?
The solution is simple: you need to execute more than one query!
The cause of all the "duplicate" rows is that you're generating a result called a Cartesian product. You are trying to join to several tables with 1:n relationships, but each of these has no relationship to the other, so there's no join condition restricting them with respect to each other.
Therefore you get a result with every combination of all the 1:n relationships. If you have 3 matches in address, 5 matches in communication, and 5 matches in categories, you'd get 3x5x5 = 75 rows.
So you need to run a separate SQL query for each of your 1:n relationships. Don't be afraid—MySQL can handle a few queries. You need them.
I have a table Things and I want to add ownership relations to a table Users. I need to be able to quickly query the owners of a thing and the things a user owns. If I know that there will be at most 50 owners, and the pdf for the number of owners will probably look like this, should I rather
add 50 columns to the Things table, like CoOwner1Id, CoOwner2Id, …, CoOwner50Id, or
should I model this with a Ownerships table which has UserId and ThingId columns, or
would it better to create a table for each thing, for example Thing8321Owners with a row for each owner, or
perhaps a combination of these?
The second choice is the correct one; you should create an intermediate table between the table Things and the table Owners (that contains the details of each owner).
This table should have the thing_id and the owner_id as the primary key.
So finally, you well have 3 tables:
Things (the things details and data)
Owner (the owners details and data)
Ownerships (the assignment of each thing_id to an owner_id)
Because in a relational DB you should not have any redundant data.
You should definitely go with option 2 because what you are trying to model is a many to many relationship. (Many owners can relate to a thing. Many things can relate to an owner.) This is commonly accomplished using what I call a bridging table. (Which exactly what option 2 is.) It is a standard technique in a normalized database.
The other two options are going to give you nightmares trying to query or maintain.
With option 1 you'll need to join the User table to the Thing table on 50 columns to get all of your results. And what happens when you have a really popular thing that 51 people want to own?
Option 3 is even worse. The only way to easily query the data is to use dynamic sql or write a new query each time because you don't know which Thing*Owners table to join on until you know the ID value of the thing you're looking for. Or you're going to need to join the User table to every single Thing*Owners table. Adding a new thing means creating a whole new table. But at least a thing doesn't have a limit on the number of owners it could possibly have.
Now isn't this:
SELECT Users.Name, Things.Name
FROM Users
INNER JOIN Ownership ON Users.UserId=Ownership.UserId
INNER JOIN Things ON Things.ThingId=Ownership.ThingId
much easier than any of those other scenarios?
First what are conditions. I have people belonging to "small" group. (which in other words means every one has "small_group_id". Then "small" groups form "big" groups (which in other words means "small_groups" may or not have "big_group_id" depending if small group belongs to bigger ot not).
I want to create a table structure (that would be used by PHP) for keeping and displaying two following things:
Public messages (means whoever is regestered or even not will be able to see it). Only author of the message can edit/delete. This is easy part :)
Private messages WITH defining how private is it. That means privat emessage should have property a) what small groups can see it b) what big groups can see it (that assumes that all members of big groups will have rights to see it).
Basically the challenge for me is how to design and later work with visibility of those private messages.
My first though was table like: msgID, msgBody, small_groups_list, big_group_list, authorID So I store e.g. in 'small_groups_id' something like 'id_1; id_4; id_10', etc and similar for big groups. But then I'm not sure how do I do search through such stored lists when e.g. person belonging to small_group_id = 10 supposed to see that mesage. Also what should be the columns small_groups_list and big_group_list defenitions/types.
Perhaps there is better way to store such things and using them as well?
That is why I'm here. What would be better practices for such requirements?
(it is going to be implemented on mySQL)
Thank you in advance.
[edit]
I'm pretty unexperienced in SQL and DB things. Please take that into account when answering.
First: Don't denormalize your data with "array" columns. That makes it a horror to query, and even worse to update.
Instead, you need two separate tables: small_group_visibility and big_group_visibility. Each of these two tables will consist of msgID and groupID. Basically, it's a many-to-many relationship that's pointing out to both the group and the message it is concerned with.
This is a pretty common database pattern.
To query for messages to be displayed, imagine that we have a user whose small groups are (1, 2, 3) and whose large groups are (10, 20).
SELECT DISTINCT msgID, msgSubject, msgBody -- and so on
FROM messages m
LEFT JOIN small_group_visibility sg
ON sg.msg_id = m.msg_id
LEFT JOIN big_group_visibility bg
ON bg.msg_id = m.msg_id
WHERE
sg.group_id IN (1, 2, 3) OR
bg.group_id IN (10, 20);
On the project I'm working on we have an activity table and each activity can be linked to one of about 20 different "activity details" tables...
e.g. If the activity was of type "work", then it would have a corresponding activity_details_work record, if it was of type "sick leave" then it would have a corresponding activity_details_sickleave record and so on.
Currently we are loading the activities and then for each activity we have a separate query to go fetch the activity details from the relevant table. This obviously doesn't scale well if you have thousands of activities.
So my initial thought was to have a single query which fetches the activities and joins the details in one go e.g.
SELECT * FROM activity
LEFT JOIN activity_details_1_work ON ...
LEFT JOIN activity_details_2_sickleave ON ...
LEFT JOIN activity_details_3_travelwork ON ...
...etc...
LEFT JOIN activity_details_20_yearleave ON ...
But this will result in each record having 100's of fields, most of which are empty and that feels nasty.
Lazy-loading the details isn't really an option either as the details are almost always requested in the core logic, at least for the main types anyway.
Is there a super clever way of doing this that I'm not thinking of?
Thanks in advance
My suggestion is to define a view for each ActivityType, that is tailored specifically to that activity.
Then add an index on the Activity table lead by the ActivityType field. Cluster said index unless there is an overwhelming need for some other to be clustered (or performance benchmarking shows some other clustering selection to be more performant).
Is there a particular reason why this degree of denormalization was designed in? Is that reason well known?
Chances are your activity tables are like (date_from, date_to, with_who, descr) or something to that effect. As Pieter suggested, consider tossing in a type varchar or enum field in there, so as to deal with a single details table.
If there are rational reasons to keep the tables apart, consider adding triggers that maintain boolean/tinyint fields (has_work, has_sickleave, etc), or a bit string (has_activites_of_type where the first position amounts to has_work, the next to has_sickleave, etc.).
Either way, you'll probably be better off by fetching the activity's details in one or more separate queries -- if only to avoid field name collisions.
I don't think enum is the way to go, because as you say there might be 1000's of activities, then altering your activity table would become an issue.
There is no point doing a left join on a large number of tables either.
So the options that you have are :
See this The first comment might be useful.
I am guessing that your activity table has a field called activity_type_id.
Build a table called activity_types containing fields activity_type_id, activity_name, activity_details_table_name. First query in the following way
activity
inner join
activity_types
using( activity_type_id )
This query gives you the table name on which to query for the details.
This way you can add any new activity type just by adding a row in the activity_types table.
I have a database with, among others, the following two tables:
classes is a straightforward table that has one row per class in a class schedule.
sessions is a table that characterizes the days and times that each class meets, where each row is capable of expressing a notion like:
"Tuesdays | Jan 22-Mar 5 | 6-9pm"
"Tuesdays & Thursdays | Jan 22-Mar 7 | 6-9pm"
"Monday-Thursday | Jan 21-24 | 3-6pm"
"Saturday | Mar 9 | 9am-4pm"
and so on.
There is guaranteed to be at least one row in sessions for each row in classes, and for certain classes there may be two or more associated session rows.
At present, I'm using two different queries to get the class and session information for the classes that match a particular set of criteria, like this:
select c.class_id, c.title, c.instructor, c.num_seats, c.price
from classes c
join classes_by_department cbd
on (cbd.class_id = c.class_id)
join /* several other tables */
on /* several other join conditions */
where cbd.department_id = '{$dept_id}'
and /* several other qualifying conditions */
;
and this:
select s.class_id, s.start_date, s.end_date, s.day_bits, s.start_time, s.end_time
from sessions s
join classes c
on (c.class_id = s.class_id)
join classes_by_department cbd
on (cbd.class_id = s.class_id)
join /* the same other tables */
on /* the same other join conditions */
where cbd.department_id = '{$dept_id}'
and /* the same other qualifying conditions */
;
This works fine, and -- at least in the current application -- the tables aren't big enough, and the traffic isn't heavy enough, for two queries to be a problem. Nevertheless, it strikes me as a bit wasteful, and I'm wondering if there isn't a way to better leverage the work already done by the first query to perform the second one (rather than what amounts to running the same query twice and just selecting different columns).
Of course I realize that I could just select all the relevant columns from classes and sessions in a single query (the second one), but I like the fact that in the current approach, the first query delivers exactly one row per qualifying class, rather than as many rows as the class has session records. I would need to restructure the existing logic that processes the query results if I merged the queries. (Yeah, I know, waah...)
One solution that occurred to me is to collect all the class_ids returned by the first query into a vector (since I have to iterate through those results anyway) and then format the contents of that vector as the content of a value-list for an IN clause, so that the second query would simply become:
select s.class_id, s.start_date, s.end_date, s.day_bits, s.start_time, s.end_time
from sessions s
where s.class_id in (/* value-list */);
I'm not too worried about the scalability of such a solution, as I understand that huge SQL queries are no big deal. Plus, it could take advantage of an index defined over sessions.class_id.
But... well... it's just not very satisfying to someone who's looking to improve his SQL chops, which I'll freely admit are pretty rudimentary. It feels inelegant, and not very "SQL-ish," or whatever the SQL equivalent to the term Pythonic is.
Can anyone suggest something more appropriate?
The canonical way to do what you want is to use views. Define your first query as:
create view vw_MyClasses as
select c.class_id, c.title, c.instructor, c.num_seats, c.price, cbd.department_id
from classes c
join classes_by_department cbd
on (cbd.class_id = c.class_id)
join /* several other tables */
on /* several other join conditions */
where /* several other qualifying conditions */
Then your class query would be:
select *
from vw_MyClasses
where department_id = '{$dept_id}'
Then, your second query can be:
select s.class_id, s.start_date, s.end_date, s.day_bits, s.start_time, s.end_time
from sessions s
where s.class_id in (select class_id from vw_MyClasses
where department_id = '{$dept_id}');
Or, what may be more efficient in MySQL:
select s.class_id, s.start_date, s.end_date, s.day_bits, s.start_time, s.end_time
from sessions s
where exists (select 1 from vw_MyClasses mc where mc.class_id = s.class_id limit 1)
There is a very good reason for doing this. Repeating such logic in multiple queries becomes a maintenance nightmare. As you modify the logic in one place, it is very easy to forget to make the modifications in all places. Sometimes, views are not sufficient, so you may need to use user defined functions, as explained here.
Also, if the criteria are so useful, you might want to put flags in the class table to identify them. This requires maintaining them in some way, such as nightly updates or using triggers.
In all honesty I wouldn't bother. Firstly it works just fine an seems fairly elegant to me from what you've told us. Secondly, if there's no reason to bring back extra data on the second query then don't do it. Thirdly and by far the most important is that as it currently stands it is fairly easy to understand what's happening. You may not always be the only person trying to decipher this and it is important that the code is readable by someone else. Over complicated SQL queries are not nice.
I think it is just fine as is and it's SQL-ishness is good.