I'm in the process of designing a new database for a project at work. I want to create a table that stores Assignments for a digital classroom. Each Assignment can be one of 2 categories: "Individual" or "Group".
The first implementation that comes to mind is the following:
CREATE TABLE `assignments` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) DEFAULT NULL,
`category` varchar(10) NOT NULL DEFAULT 'individual',
PRIMARY KEY (`id`),
KEY `category_index` (`category`(10))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I would then select all assignments of a given category with:
SELECT title FROM assignments WHERE category = "individual"
However, because we've had performance issues in the past, I'm trying to optimize the design as much as possible. As such, I'm wondering whether or not storing the category as a VARCHAR is a good idea (considering the table will get quite large)? Would indexing an INT perform better over a VARCHAR?
Aside from just performance, I'm also curious what would be considered a good solution from a design-perspective. Suggestions?
Related
I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?
I recently came across a project where I want to add some apps into a database. Every app has additional information that comes in a 1:1, 1:n or n:m relationship. Though I know how to store such relationships, I had some trouble with the developer(s) / publisher(s) for each app.
The situation:
several thousand apps
each app has its own id
several thousand companies
each company (developer/publisher) has its own id
each app can have 0, 1 or multiple developers
each app can have 0, 1 or multiple publishers
each developer can have 1 or multiple apps
each publisher can have 1 or multiple apps
It's pretty obvious that this is a many-to-many relationship and thus requires a junction table. Unfortunately, there are at least two viable options.
company
CREATE TABLE `company` (
`id` smallint(5) UNSIGNED NOT NULL,
`name` varchar(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
(I merged developers and publishers in this table, because a developer can also be a publisher and vice versa. I think this is better than having redundancy in two separate tables, isn't it?)
Option 1:
The first option would be to create two separate tables.
app_developer
CREATE TABLE `app_developer` (
`id` mediumint(8) UNSIGNED NOT NULL,
`app_id` mediumint(8) UNSIGNED NOT NULL,
`company_id` mediumint(8) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
app_publisher
CREATE TABLE `app_publisher` (
`id` mediumint(8) UNSIGNED NOT NULL,
`app_id` mediumint(8) UNSIGNED NOT NULL,
`company_id` mediumint(8) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Option 2:
The second option would be to create a single table and add flags (0/1) for each app/company combination.
CREATE TABLE `app_company_rel` (
`id` mediumint(8) UNSIGNED NOT NULL,
`app_id` mediumint(8) UNSIGNED NOT NULL,
`company_id` mediumint(8) UNSIGNED NOT NULL,
`developer` tinyint(1) UNSIGNED NOT NULL,
`publisher` tinyint(1) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I don't know if there will be the requirement to search all apps from a specific developer/publisher in the future or if it's just an additional information without further purpose.
Which option would be better (in terms of consistency, redundancy, performance) or is there no considerable difference?
Your first option will be very much efficient. Storing in two different table is actually good. Create two tables and use app_id as foreign key. Storing in two different table makes data very clear and data retrieval will also be very easy and faster. Any doubt let me know, will explain you clearly
Option 3: Like #2, but with an ENUM or SET for dev and pub.
I would consider either Option 1 or Option 3. But I would not include an id for a simple many-to-many mapping table; it slows things down.
More discussion and tips on how to write an optimal many-to-many table.
Your second option is headed the right way. But this establishes not only the relationship between a company and a project but also the type of relationship -- developer or publisher.
create table ProjCompany(
ProjID int not null references Projects( ID ),
CompanyID int not null references Company( ID ),
TypeID char( 1 ) not null references Types( ID ), -- 'D' or 'P'
constraint OK_ProjCompany primary key( ProjID, CompanyID, TypeID )
);
A project can have each company listed as a developer only once but the same company could also show up as a publisher. A company could be a developer and/or publisher for any number of projects.
If any table needed a FK reference to a particular developer of a particular project, it would reference this table with the project id, the company id and the flag for developer. If that company was not defined as a developer for that project, the reference would be rejected.
Further, I would recommend a view that would show each project and their developers and a view that would show each project and their publishers. This would come in handy for portions of code that would be working only with developers or only with publishers.
Researching hierarchical data persistence and led me to closure tables and pieced together this comment structure based off of the culmination of said research.
Queries for creating new nodes in the closure table were easy enough for me to grasp and fetching data for descendants via a JOIN on the closure table is simple enough.
However, I would like to expand upon that and get results back sorted and limited by both number of parents/children down through a depth of x.
I'm trying to keep things timely/efficient (I expect comments table to get very large) by making use of foreign keys and indexes. I am shooting for an all in one query that can do what I ask in the title, but am not opposed to breaking it up to increase speed/efficiency.
Current table structures:
CREATE TABLE `comments` (
`comment_id` int(11) UNSIGNED PRIMARY KEY,
`reply_to` int(11) UNSIGNED NOT NULL DEFAULT '0',
`user_id` int(11) UNSIGNED NOT NULL,
`comment_time` int(11) NOT NULL,
`comment` mediumtext NOT NULL,
FOREIGN KEY (`user_id`) REFERENCES users(`user_id`)
) Engine=InnoDB
CREATE TABLE `comments_closure`(
`ancestor_id` int(11) UNSIGNED NOT NULL,
`descendant_id` int(11) UNSIGNED NOT NULL,
`length` tinyint(3) UNSIGNED NOT NULL DEFAULT '0',
PRIMARY KEY(`ancestor_id`, `descendant_id`),
KEY `tree_adl`(`ancestor_id`, `descendant_id`, `length`),
KEY `tree_dl`(`descendant_id`, `length`),
FOREIGN KEY (`ancestor_id`) REFERENCES comments(`comment_id`),
FOREIGN KEY (`descendant_id`) REFERENCES comments(`comment_id`)
) Engine=InnoDB
A clearer summary of what I'm trying to do would be to fetch 20 comments that share an ancestor_id, sorted by time. While also fetching each one's comments 2 length deeper (keeping these limited to a much smaller amount 2) also sorted by time.
I'm not looking to always sort by time however and would also like to fetch results sorted by their comment_id Is it possible to do all this in a single query? I'm not quite sure where to begin.
This is a pretty basic question, but I'm confused by what I'm reading in various places. I have a simple table that doesn't contain a huge amount of data (less than 500 rows for any given db is typical) A typical query against this table looks like :
select system_fields.name from system_fields where system_fields.form_id=? and system_fields.field_id=?
My question is, should I have a separate index for form_id and one for field_id, or should I be creating an index on a combination of those two fields? I've never really done anything with multi-column indexes before.
CREATE TABLE IF NOT EXISTS `system_fields` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`field_id` int(11) NOT NULL,
`form_id` int(11) NOT NULL,
`name` varchar(50) NOT NULL,
`reference_field_id` varchar(1000) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `field_id` (`field_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=293 ;
If you are always going to query by these two fields, then add a multi-column index.
I'll also point out that if you're going to have < 500 rows in the table, your index may not even get used. Any performance difference with or without an index on a 500-row table will be negligible.
Here's a bit more (good) reading:
https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/
I'm trying to implement a way to track changes to a table named user and another named report_to Below are their definitions:
CREATE TABLE `user`
(
`agent_eid` int(11) NOT NULL,
`agent_id` int(11) DEFAULT NULL,
`agent_pipkin_id` int(11) DEFAULT NULL,
`first_name` varchar(45) NOT NULL,
`last_name` varchar(45) NOT NULL,
`team_id` int(11) NOT NULL,
`hire_date` date NOT NULL,
`active` bit(1) NOT NULL,
`agent_id_req` bit(1) NOT NULL,
`agent_eid_req` bit(1) NOT NULL,
`agent_pipkin_req` bit(1) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
UNIQUE KEY `agent_id_UNIQUE` (`agent_id`),
UNIQUE KEY `agent_pipkin_id_UNIQUE` (`agent_pipkin_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `report_to`
(
`agent_eid` int(11) NOT NULL,
`report_to_eid` int(11) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
KEY `report_to_report_fk_idx` (`report_to_eid`),
CONSTRAINT `report_to_agent_fk` FOREIGN KEY (`agent_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `report_to_report_fk` FOREIGN KEY (`report_to_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8
What can change that needs to be tracked is user.team_id, user.active and report_to.report_to_eid. What i currently have implemented is a table that is populated via an update trigger on user that tracks team changes. That table is defined as:
CREATE TABLE `user_team_changes`
(
`agent_id` int(11) NOT NULL,
`date_changed` date NOT NULL,
`old_team_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
PRIMARY KEY (`agent_id`,`date_changed`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
This works fine for just tracking team changes. I'm able to use joins and a union to populate a history view that tracks that change over time for the individual users. The issue of complexity rises when I try to implement tracking for the other two change types.
I have thought about creating additional tables similar to the one tracking changes for teams, but I worry about performance hits due to the joins that will be required.
Another way I have considered is creating a table similar to a view that I have that details the current user state (it joins all necessary user data together from 4 tables), then insert a record on update with a valid until date field added. My concern with that is the amount of space this could take.
We will be using the user change history quite a bit as we will be running YTD, MTD, PMTD and time interval reports with it on an almost daily basis.
Out of the two options I am considering, which would be the best for my given situation?
The options you've presented:
using triggers to populate transaction-log tables.
including a new table with an effective-date columns in the schema and tracking change by inserting new rows.
Either one of these will work. You can add logging triggers to other tables without causing any trouble.
What distinguishes these two choices? The first one is straightforward, once you get your triggers debugged.
The second choice seems to me that it will create denormalized redundant data. That is never good. I would opt not to do that. It is possible with judicious combinations of views and effective-date columns to create history tables that are viewable as the present state of the system. To learn about this look at Prof. RT Snodgrass's excellent book on Developing Time Oriented applications. http://www.cs.arizona.edu/~rts/publications.html If you have time to do an excellent engineering (over-engineering?) job on this project you might consider this approach.
The data volume you've mentioned will not cause intractable performance problems on any modern server hardware platform. If you do get slowdowns on JOIN operations, it's almost certain that the addition of appropriate indexes will completely fix them, as long as you declare all your DATE, DATETIME, and TIMESTAMP fields NOT NULL. (NULL values can mess up indexing and searching).
Hope this helps.