I have inherited a table with information about some groups of people in which one field which contains delimited data, with the results matched to another table.
id_group Name
-----------------------
1 2|4|5
2 3|4|6
3 1|2
And in another table I have a list of people who may belong to one or more groups
id_names Names
-----------------------
1 Jack
2 Joe
3 Fred
4 Mary
5 Bill
I would like to perform a select on the group data which results in a single field containing a comma or space delimited list of names such as this from the first group row above "Joe Fred Bill"
I have looked at using a function to split the delimited string, and also looked at sub queries, but concatenating the results of sub queries quickly becomes huge.
Thanks!
As implied by Strawberry's comment above, there is a way to do this, but it's so ugly. It's like finishing your expensive kitchen remodel using duct tape. You should feel resentment toward the person who designed the database this way.
SELECT g.id_group, GROUP_CONCAT(n.Names SEPARATOR ' ') AS Names
FROM groups AS g JOIN names AS n
ON FIND_IN_SET(n.id_names, REPLACE(g.Name, '|', ','))
GROUP BY g.id_group;
Output, tested on MySQL 5.6:
+----------+---------------+
| id_group | Names |
+----------+---------------+
| 1 | Joe Mary Bill |
| 2 | Fred Mary |
| 3 | Jack Joe |
+----------+---------------+
The complexity of this query, and the fact that it will be forced to do a table-scan and cannot be optimized, should convince you of what is wrong with storing a list of id's in a delimited string.
The better solution is to create a third table, in which you store each individual member of the group on a row by itself. That is, multiple rows per group.
CREATE TABLE group_name (
id_group INT NOT NULL,
id_name INT NOT NULL,
PRIMARY KEY (id_group, id_name)
);
Then you can query in a simpler way, and you have an opportunity to create indexes to make the query very fast.
SELECT id_group, GROUP_CONCAT(names SEPARATOR ' ') AS names
FROM groups
JOIN group_name USING (id_group)
JOIN names USING (id_name)
Shadow is correct. Your primary problem is the bad design of relations in the database. Typically one designs this kind of business problems as a so-called M:N relation (M to N). To accomplish that you need 3 tables:
first table is groups that has a GroupId field with primary key on it and a readable name field (e.g. 'group1' or whatever)
second table is people that looks exactly as you showed above. (do not forget to include a primary key in the PeopleId field also here)
third table is a bridge table called GroupMemberships. That one has 2 fields GroupId and PeopleId. This table connects the first two with each other and marks the M:N relation. One group can have 1 to N members and people can be members of 1 to M groups.
Finally, just join together the tables in the select and aggregate:
SELECT
g.Name,
GROUP_CONCAT(p.Name ORDER BY p.PeopleId DESC SEPARATOR ';') AS Members
FROM
Groups AS g
INNER JOIN GroupMemberships AS gm ON g.GroupId = gm.GroupId
INNER JOIN people AS p ON gm.PeopleId = p.PeopleId
GROUP BY g.Name;
I'm trying to think on the most performant database schema for a specific data structure. There are two main entities: Courses and Themes. A Course is a collection of Themes. A Theme have fields like Videos, Resources and Video Total Time.
Visually representing this data structure:
- Course
|_ ID: 12345
|_ Themes: [A, B] (an array of UIDs)
- Theme A
|_ Courses: [12345,67890] (an array of UIDs)
|_ Videos: [1,2,3,4,5,7] (an array of UIDs)
|_ Resources: [10,11,12] (an array of UIDs)
|_ Video Total Time: 10000 (probably stored as seconds as tinyint field)
- Theme B
|_ Courses: [12345,98765] (an array of UIDs)
|_ Videos: [5,6,7,8] (an array of UIDs)
|_ Resources: [12,13,14] (an array of UIDs)
|_ Video Total Time: 20000 (probably stored as seconds as tinyint field)
What I'm trying to achieve is a database schema for two tables, one for Courses, and one for Themes. The idea would be to have a MySQL query that gets a Course and group all fields from the Themes. In other words, when I get the result of the MySQL query, (using PHP) I'll get an array or object like this:
Array(
'ID' => 12345
'themes' => [A,B]
'videos' => [1,2,3,4,5,6,7,8]
'resources' => [10,11,12,13,14]
'video_total_time' => 30000
)
So, the point is that they are two relational databases. When I send a query to the DB requesting data from the video, I need to pull data from all the themes, and merge them together.
Since I'm not an expert on SQL / MySQL, I'm trying to learn a little bit about it while I try to figure out:
1) What is the best database schema for these two entities? Courses and Themes? Thinking specially about performance
2) Can I get the final data all using SQL? Or should I pull some data from the database, and then parse the data with PHP? What is usually faster?
3) What is the best way to store the array of UIDs? As a string? Or there's a better way to store it?
The primary goal of this is performance. I have this kind of data in a different database schema, merged with thousands of other kinds of data (WP databases, wp_posts / wp_postmeta tables), but right now it's really slow to get the information I need.
Any tips and suggestions are more than welcome!
Edit: Solved!
It was a tough call to decide which answer suits best my needs, because #TimMorton's and #PaulSpiegel's answers lead us to the same path, but with slightly different approaches. Tim's answer is great to understand how to properly design database schemas, taking into account many-to-many relationships, and how to organize your queries. But since the main focus of this question is improve performance, Paul's answer is more focused on that, with specific details about primary keys and indexes (which are fundamental to improve performance of the queries).
Anyways, I learned a lot about designing a database schema. Here's the lessons I learned:
Don' try to stuff everything into the same table: it is fundamental to identify the entities properly before defining which tables you need. I started with two tables, for Videos and Themes. But turns out that a proper DB schema for my specification includes tables for Videos and Resources.
Don't store arrays into columns: use a proper strategy to define the relation between entities. If you have one-to-one or one-to-many relationships, use the entities IDs and foreign keys. If you have many-to-many relationships, then the proper design pattern is to create a dedicated table only to create relations between entities. This will allow you to use JOIN clauses into your queries to put all the data together.
When you think about performance, think about INDEX: depending on your table structure, using either an index or a composite index can improve query performance.
Don't try to get everything in one big query: you definitely can, but having separate queries for sections of the data you need (on my example, one query to get all themes for a course, one to get all videos for the course, one to get resources for the course) pays off with code organization and readability.
I don't know if I'm correct with everything above, but it's what I learned so far. Hope this helps someone else too.
Creating the schema
Step 1: Identify entities and their attributes
Course (ID, title, description)
Theme (ID, title, description)
Video (ID, title, description, duratation)
Ressource (ID, title, url)
Step 2: Identify relations
Theme => Course
Video => Theme
Ressource => Theme
Step 3: Create tables
courses
ID (PK)
title
description
themes
ID (PK)
course_id (FK)
title
description
videos
ID (PK)
theme_id (FK)
title
description
duratation
ressources
ID (PK)
theme_id (FK)
title
url
If themes can share videos and ressources, then it would be many-to-many relations.
In this case you would need separate tables for those relations.
Remove the theme_id column from videos and ressources and add the following tables:
themes_videos
theme_id (PK) (FK)
video_id (PK) (FK)
themes_ressources
theme_id (PK) (FK)
ressource_id (PK) (FK)
Here you should define composite primary keys on (theme_id, video_id) and (theme_id, ressource_id).
Also create reverse indexes on (video_id, theme_id) and (ressource_id, theme_id).
Retrieving data
Assuming you know the ID of the course (which is 123),
you can then retrieve the related data (from the many-to-many schema)
with the following queries (which you execute one by one):
select c.*
from courses c
where c.id = 123;
select t.*
from themes t
where t.course_id = 123;
select distinct v.*
from themes t
join themes_videos tv on tv.theme_id = t.id
join videos v on v.id = tv.video_id
where t.course_id = 123;
select distinct r.*
from themes t
join themes_ressources tr on tr.theme_id = t.id
join ressources r on r.id = tr.ressource_id
where t.course_id = 123;
Then compose your array/object from retrieved data in PHP.
Performance
Trying to get all data with a single SQL query is not always a good idea.
You just make your code and schema too complicated.
Executing a couple of queries is not the end of the world.
What you should avoid, is running executing a query in a loop
(like: for each theme select related videos).
In it's simplest form, assuming no many to many relationships:
Course Theme
-------- --------
CourseID <--+ ThemeId
Name | Name
+------ CourseID
|
|
| Video
| --------
| VideoID
| Name
| Length
+------ CourseID
|
|
| Resource
| --------
| ResourceID
| Name
+------ CourseID
In this form, a Course can have many themes, many videos, and many resources; but each theme, video, and resource can have only one course.
However, I don't think that's how you want it.
I would lean more towards
Course Theme
-------- --------
+----> CourseId +---> ThemeId
| Name | Name
| ThemeId ----+
|
|
| Video
| --------
| VideoID
| Name
| Length
+------ CourseID
|
|
| Resource
| --------
| ResourceID
| Name
+------ CourseID
This allows a course to have only one theme, but many videos and resources. This allows the themes to have more than one course.
But it still doesn't quite fit the bill...
This one allows many courses to share the same theme, as well as have more than one theme:
Course Course_Theme Theme
-------- ------------ --------
+----> CourseId <----- CourseId +--> ThemeId
| Name ThemeId ---+ Name
| ThemeId
|
|
| Video
| --------
| VideoID
| Name
| Length
+------ CourseID
|
|
| Resource
| --------
| ResourceID
| Name
+------ CourseID
As this stands now, each course can have many themes, videos, and resources.
Each theme can have many courses.
Each video and resource belongs to a course (i.e., can have only one course)
If a video or resource can be for more than one course, then you'll have to expand it just as I did with themes.
As per comment, everything is many to many. Notice I don't have any direct relations between themes and videos nor themes and resources. I don't think they will be necessary; you should be able to pick up what you need going through courses.
Course Course_Theme Theme
-------- ------------ --------
+----> CourseId <---- CourseId
| Name ThemeId ----------> ThemeId
| Name
|
| Course_Video Video
| ------------ --------
+---------------------- CourseId
| VideoId ----------> VideoId
| Name
| Length
|
| Course_Resource Resource
| --------------- --------
+----------------------- CourseId
ResourceId -------> ResourceId
Name
Url, etc.
Now for the queries. Although it is possible to use aggregate functions along with group by, I think it makes far more sense to keep it simple and just pull things out one at a time.
Themes per course
SELECT T.*
FROM COURSE C
INNER JOIN COURSE_THEME CT ON CT.COURSEID=C.COURSEID
INNER JOIN THEME T ON CT.THEMEID=T.THEMEID
WHERE {insert your search conditions on course}
or, if you know CourseId:
SELECT T.*
FROM THEME T
INNER JOIN COURSE_THEME CT ON T.THEMEID = CT.THEMEID
WHERE CT.COURSEID = ?
likewise,
Videos per course
SELECT V.*
FROM COURSE C
INNER JOIN COURSE_VIDEO CV ON CV.COURSEID=CV.COURSEID
INNER JOIN VIDEO ON CV.VIDEOID=V.VIDEOID
WHERE {insert your search conditions on course}
or, if you know the CourseId:
SELECT V.*
FROM VIDEO V
INNER JOIN COURSE_VIDEO CV ON CV.VIDEOID = V.VIDEOID
WHERE CV.COURSEID = ?
to select the sum of the video lengths per course,
SELECT SUM(LENGTH) AS TOTAL
FROM VIDEO
INNER JOIN COURSE_VIDEO CV ON CV.VIDEOID = V.VIDEOID
WHERE CV.COURSEID = ?
GROUP BY CV.COURSEID
Now, the tricky part is videos per theme. I am making an assumption here: the set of videos per theme is the same as the set of videos per course per theme.
The long way around:
SELECT V.*
FROM VIDEO V
INNER JOIN COURSE_VIDEO CV ON VIDEO.VIDEOID = CV.VIDEOID
INNER JOIN COURSE C ON COURSEID = CV.COURSEID
INNER JOIN COURSE_THEME CT ON C.COURSEID = CT.COURSEID
INNER JOIN THEME T ON CT.THEMEID = T.THEMEID
WHERE THEMEID = ?
Blech. You can cut out the middlemen:
SELECT V.*
FROM VIDEO V
INNER JOIN COURSE_VIDEO CV ON VIDEO.VIDEOID = CV.VIDEOID
INNER JOIN COURSE_THEME CT ON CV.COURSEID = CT.COURSEID
WHERE CT.THEMEID = ?
When you have your tables normalized, you can get any piece of information from whatever starting point you choose. FWIW, your example is a fairly complicated one since everything is many to many relations.
Update
Even though I had courses as the root, even when themes are the root things don't change much:
Theme Course_Theme Course
-------- ------------ --------
+----> ThemeId <---- ThemeId
| Name CourseId ---------> CourseId
| Name
|
| Theme_Video Video
| ------------ --------
+---------------------- ThemeId
| VideoId ---------> VideoId
| Name
| Length
|
| Theme_Resource Resource
| -------------- --------
+----------------------- ThemeId
ResourceId ------> ResourceId
Name
Url, etc.
In this configuration, courses have videos and resources through ThemeId, i.e.:
SELECT V.*
FROM COURSE_THEME CT
INNER JOIN VIDEO_THEME VT ON VT.THEMEID = CT.THEMEID
INNER JOIN VIDEO V ON V.VIDEOID = VT.VIDEOID
WHERE CT.THEMEID = ?
Table Structure
Make tables like image as shown and use json encode/decode time of input/out. In the query you can have total time from the table.
Suppose I had these 4 tables, consisting of various foreign key relationships (eg a area must belong to a location, a shop must belong to area, an item price must belong to a shop ect..)
----------------------------------
|Location Name | Location ID |
| | |
----------------------------------
-------------------------------------------------
|Area Name | Area ID | Location ID |
| | | |
-------------------------------------------------
-------------------------------------------------
| Shop Name | Shop ID | Area ID |
| | | |
-------------------------------------------------
----------------------------------
| Item Price | Shop ID |
| | |
----------------------------------
And I wanted the sum of 'Item Price' that belonged to a specific location id. So all the areas and shops item price total for location id 'x'.
One way I found to do this is to join all the tables for one location and get the amount eg:
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join areas ON (shops.area id = areas.area id)
left join locations ON (areas.location id = location.location id)
WHERE Location Id = 4;
However is this the best way to do this since it involves retrieving the full tree of the data and filtering it out? Would there be a better way if there are a million rows or is this the best way?
You can try sub query --
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join (select area id from areas where Location Id = 4) as Ar ON (shops.area id = areas.area id)
If you define the right indexes, then the query does not read all the millions of rows for each table.
Think about a telephone book and how you look up a name. Do you read the whole book cover to cover looking for the name? No, you take advantage of the fact that the book is sorted by lastname, firstname and you go directly to the name. It takes only a few tries to find the right page. In fact, on average it takes about log2N tries for a book with N names in it.
The same kind of search happens for each join. If you have indexes, each comparison expression uses a similar lookup to find matching rows in the joined table. It's pretty fast.
But if that's not fast enough, you can also use denormalization, which in this case would be storing all the data in one table, with many columns wide.
----------------------------------------------------------------------
|Location Name | Area Name | Shop Name | Item Name | Item Price |
| | | | | |
----------------------------------------------------------------------
The advantage of denormalization is that it avoids certain joins. It stores the row just like one of the rows you'd get from the result set of your example joined SQL query. You just read one row from the table and you have all the information you need.
The disadvantage of denormalization is the redundant storage of data. Presumably each shop has many items. But each item is stored on a row of its own, which means that row has to repeat the names of the shop, area, and location.
By storing those data repeatedly, you create an opportunity for "anomalies" like if you change the name of a given shop, but you mistakenly change it only on a few rows instead of everywhere the shop name appears. Now you have two names for the same shop, and someone else looking at the database has no way of knowing which one is correct.
In general, maintaining multiple normalized tables in preferable, because each "fact" is stored exactly once, so there can be no anomalies.
Creating indexes to help your queries is sufficient for most applications.
You might like my presentation, How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU
Sorry if my question seems unclear, I'll try to explain.
I have a column in a row, for example /1/3/5/8/42/239/, let's say I would like to find a similar one where there is as many corresponding "ids" as possible.
Example:
| My Column |
#1 | /1/3/7/2/4/ |
#2 | /1/5/7/2/4/ |
#3 | /1/3/6/8/4/ |
Now, by running the query on #1 I would like to get row #2 as it's the most similar. Is there any way to do it or it's just my fantasy? Thanks for your time.
EDIT:
As suggested I'm expanding my question. This column represents favourite artist of an user from a music site. I'm searching them like thisMyColumn LIKE '%/ID/%' and remove by replacing /ID/ with /
Since you did not provice really much info about your data I have to fill the gaps with my guesses.
So you have a users table
users table
-----------
id
name
other_stuff
And you like to store which artists are favorites of a user. So you must have an artists table
artists table
-------------
id
name
other_stuff
And to relate you can add another table called favorites
favorites table
---------------
user_id
artist_id
In that table you add a record for every artist that a user likes.
Example data
users
id | name
1 | tom
2 | john
artists
id | name
1 | michael jackson
2 | madonna
3 | deep purple
favorites
user_id | artist_id
1 | 1
1 | 3
2 | 2
To select the favorites of user tom for instance you can do
select a.name
from artists a
join favorites f on f.artist_id = a.id
join users u on f.user_id = u.id
where u.name = 'tom'
And if you add proper indexing to your table then this is really fast!
Problem is you're storing this in a really, really awkward way.
I'm guessing you have to deal with an arbitrary number of values. You have two options:
Store the multiple ID's in a blob object in JSON format. While MySQL doesn't have JSON functions built in, there are user defined functions that will extract values for you, etc.
See: http://blog.ulf-wendel.de/2013/mysql-5-7-sql-functions-for-json-udf/
Alternatively, switch to PostGres
Add as many columns to your table as the maximum number of ID's you expect to have. So if /1/3/7/2/4/8/ is the longest entry, have 6 columns in your table. Reason this is bad: you'll have sparse columns that'll unnecessarily slow your tables.
I'm sure you could write some horrific regex to accomplish the task, but I caution on using complex regex's on enormous tables.
I have a list of communities. Each community has a list of members. Currently I am storing each community in a row with the member names separated by a comma. This is good for smaller immutable communities. But as the communities are growing big, let us say with 75,000 members, loading of communities is becoming slower. Also partial loading of a community (let us say random 10 members) is also not very elegant. What would be the best table structure for the communities table in this scenario? Usage of multiple tables is also not an issue if there is a reason for doing that.
Use three tables
`community`
| id | name | other_column_1 | other_column_2 ...
`user`
| id | name | other_column_1 | other_column_2 ...
`community_user`
| id (autoincrement) | community_id | user_id |
Then to get user info for all users in a community you do something like this
SELECT cu.id AS entry_id, u.id, u.name FROM `community_user` AS cu
LEFT JOIN `user` AS u
ON cu.user_id = u.id
WHERE cu.community_id = <comminuty id>