SQL Design Decision: Should I merge these tables? - mysql

I'm attempting to design a small database for a customer. My customer has an organization that works with public and private schools; for every school that's involved, there's an implementation (a chapter) at each school.
To design this, I've put together two tables; one for schools and one for chapters. I'm not sure, however, if I should merge the two together. The tables are as follows:
mysql> describe chapters;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| school_id | int(10) unsigned | NO | MUL | | |
| is_active | tinyint(1) | NO | | 1 | |
| registration_date | date | YES | | NULL | |
| state_registration | varchar(10) | YES | | NULL | |
| renewal_date | date | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
7 rows in set (0.01 sec)
mysql> describe schools;
+----------------------+------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------------------------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| full_name | varchar(255) | NO | MUL | | |
| classification | enum('high','middle','elementary') | NO | | | |
| address | varchar(255) | NO | | | |
| city | varchar(40) | NO | | | |
| state | char(2) | NO | | | |
| zip | int(5) unsigned | NO | | | |
| principal_first_name | varchar(20) | YES | | NULL | |
| principal_last_name | varchar(20) | YES | | NULL | |
| principal_email | varchar(20) | YES | | NULL | |
| website | varchar(20) | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+----------------------+------------------------------------+------+-----+---------+----------------+
12 rows in set (0.01 sec)
(Note that these tables are incomplete - I haven't implemented foreign keys yet. Also, please ignore the varchar sizes for some of the fields, they'll be changing.)
So far, the pros of keeping them separate are:
Separate queries of schools and
chapters are easier. I don't know if
it's necessary at the moment, but
it's nice to be able to do.
I can make a chapter inactive
without directly affecting the
school information.
General separation of data - the fields in
"chapters" are directly related to
the chapter itself, not the school
in which it exists. (I like the
organization - it makes more sense
to me. Also follows the "nothing but the key" mantra.)
If possible, we can collect school
data without having a chapter
associated with it, which may make
sense if we eventually want people
to select a school and autopopulate
the data.
And the cons:
Separate IDs for schools and
chapters. As far as I know, there
will only ever be a one-to-one
relationship between the two, so
doing this might introduce more
complexity that could lead to errors
down the line (like importing data
from a spreadsheet, which is unfornately
something I'll be doing a lot of).
If there's a one-to-one ratio, and
the IDs are auto_increment fields,
I'm guessing that the chapter_id and
school_id will end up being the same - so why not just put them in a single table?
From what I understand, the chapters
aren't really identifiable on their
own - they're bound to a school, and
as such should be a subset of a
school. Should they really be
separate objects in a table?
Right now, I'm leaning towards keeping them as two separate tables; it seems as though the pros outweigh the cons, but I want to make sure that I'm not creating a situation that could cause problems down the line. I've been in touch with my customer and I'm trying to get more details about the data they store and what they want to do with it, which I think will really help. However, I'd like some opinions from the well-informed folks on here; is there anything I haven't thought of? The bottom line here is just that I want to do things right the first time around.

I think they should be kept separate. But, you can make the chapter a subtype of a school (and the school the supertype) and use the same ID. Elsewhere in the database where you use SchoolID you mean the school and where you use ChapterID you mean the chapter.
CREATE TABLE School (
SchoolID int unsigned NOT NULL AUTO_INCREMENT,
CONSTRAINT PK_School PRIMARY KEY (SchoolID)
)
CREATE TABLE Chapter (
ChapterID int unsigned NOT NULL,
CONSTRAINT PK_Chapter PRIMARY KEY (ChapterID)
CONSTRAINT FK_Chapter_School FOREIGN KEY (ChapterID) REFERENCES School (SchoolID)
)
Now you can't have a chapter unless there's a school first. If such a time occurred that you had to allow multiple chapters per school, you would recreate the Chapter table with ChapterID as identity/auto-increment, add a SchoolID column populated with the same value and put the FK on this one to School, and continue as before, only inserting the ID to SchoolID instead of ChapterID. If MySQL supports inserting explicit values to an autoincrement column, then making it SchoolID autoincrement ahead of time could save you trouble later (unless switching a regular column to autoincrement is supported in which case no issues there).
Additional benefits of keeping them separate:
You can make foreign key relationships directly with SchoolID or ChapterID so that the data you're storing is always correct (for example, if no chapter exists yet you can't store related data for such a thing until it is created).
Querying each table separately will perform better as the rows don't contain extraneous information.
A school can be created with certain required columns, but the chapter left uncreated (temporarily). Then, when it is created, you can have some NOT NULL columns in it as well.

keep them separate.
they may be 1-1 currently... however these are clearly separate concepts.
will they eventually want to input schools which do not have chapters? perhaps as part of a sales lead system?
can there really only be one chapter per school or just one active chapter ? what about across time? is it possible they will request a report with all chapters in the past 10 years at x school ?

You said the links will always be 1 to 1, but does a school always have a chapter can it change chapters? If so, then keeping chapters separate is a good idea.

Another reason to keep them separate is if the amount of information about the two entities combined would make the length of the records longer than the database backend can handle. One-to_one tables are often built to keep the amount of data that needs to be stored in a record down to an appropriate size.
Further is the requirement a firm 1-1 or is does it have the potential to be 1-many? If the second, make it a separate table now. Id there the potential to have schools without chapters? Again I'd keep them separate.
And how are you intending to query this data, will you generally need the data about both the chapter and school in the same queries, then you might put them in one table if you are sure there is no possibility of it turning into a 1-many relationship. However a proper join with the join fields indexed should be fast anyway.
I tend to see these as separate entities and would keep them separte unless there was a critcal performance problem that would lead to putting them to gether. I think that having separate entities in separate table from the start tends to be less risky than putting them together. And performance would normally be perfectly acceptable as long as the indexing is correct and may even be better if you don't normally need to query data from both tables all the time.

Related

How do I structure my "products" table correctly with over 250,000 rows?

The Problem
I landed a small gig to develop an online quoting system for an electronic distributor. He has roughly a half million parts - one little screw is considered a part, one little led, etc. So there are a LOT of parts.
One Important Note: This is only a RFQ ( Request for Quote ). There are no prices client-side, or totals, or anything to do with money. Just collecting a list of part numbers to send to my client.
I had to collect the part data from multiple sources (vendor website, scanned paper catalog, Excel spreadsheets, CSV files, and even a few JSON files. It was exhausting, but I got it done.
Results
Confusing at first. I had dozens of product categories, and some products had so many attributes that were not common to any other products. I could see this project getting very complicated, and given the fact I bid this job at $900 even, I had to simplify this somehow.
This is what I came up with, and received client approval.
Current Columns
+--------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+--------------+------+-----+---------+-------+
| Datasheets | varchar(128) | YES | | NULL | |
| Image | varchar(85) | YES | | NULL | |
| DigiKey_Part_Number | varchar(46) | YES | | NULL | |
| Manufacturer_Part_Number | varchar(47) | YES | | NULL | |
| Manufacturer | varchar(49) | YES | | NULL | |
| Description | varchar(34) | YES | | NULL | |
| Quantity_Available | int(11) | YES | | NULL | |
| Minimum_Quantity | int(11) | YES | | NULL | |
+--------------------------+--------------+------+-----+---------+-------+
so all products will fit this page template (menu on bottom is error in screenshot):
Autocomplete Off The Table?
Early on in the design, I implemented a nice autocomplete feature:
BUT .. given the number of products in the table, is this even
practical anymore ???
FINAL PRODUCT COUNT: 223,347
What changes do I need to make to PRODUCTS table so that querying the table will not take forever?
These are the only queries the app will be making ( not sure if this info will help in your solution advice )...
Get all products by category:
Select * from products where category = 'semiconductors'
Get single product:
Select * from products where Manufacturer_Part_Number = '12345'
Get product count by category
I think those three actually cover everything I need to do. Maybe a couple more, but not many.
In closing...
Is there a way to "index" this table with 223000 records where searching by one or more columns can be done efficiently?
I am very new to database design, and know I do need to index SOMETHING, but ... WHAT???
Thank you for taking the time to look at this post.
Regards,
John
Listing the queries is mandatory to answering your question. Thanks for including them.
INDEX(category)
INDEX(Manufacturer_Part_Number)
But I suggest your second query should include Manufacturer, too. Then this would be better it:
INDEX(Manufacturer, Manufacturer_Part_Number)
Everything NULL? Seems unlikely.
(I've done jobs like yours; I can't imagine bidding only $900 for all that scraping.)
What will you do when there are a thousand items in a single category or manufacturer? A UI with a thousand-item list sucks.
For how to handle "so many attributes", I recommend http://mysql.rjweb.org/doc.php/eav (I should charge you $899 for the research that went into that document. Just kidding.)
Don't they need other lookups, like "Flash drive", which need to match "FLASH DRV"?
223K rows -- no problem. The VARCHARs seem to be too short; were they based on the data?
And the table needs a PRIMARY KEY.

Exclude rows in mysql which contain a year without the NOT LIKE operator

I have a table which contains tags. Almost all tags are genres (such as action and comedy). However there are also tags such as Winter 2014 and Summer 2012. These are seasonal tags.
I want to exclude those tags from a genre listing. So how do I exclude those seasonal tags in the query?
The reason I don't want to use the NOT LIKE operator is to prevent full table scans.
This is what I currently have (in eloquent):
$genres = Tag::where('slug', 'not like', '%20%')->get()->lists('name');
Sidenote: a laravel 4 (eloquent) approach would be appreciated but not necessary.
This is my table
+---------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| slug | varchar(255) | NO | MUL | NULL | |
| name | varchar(255) | NO | | NULL | |
| suggest | tinyint(1) | NO | | 0 | |
| count | int(10) unsigned | NO | | 0 | |
+---------+------------------+------+-----+---------+----------------+
If I were you, I would add seasonal tinyint(1) field to this table and now you could simply run:
$genres = Tag::whereSeasonal(0)->get()->lists('name');
to get tags that are not seasonal.
If you cannot do it, you could store ids of seasonal tags in PHP array (or in one more table) - I don't know how many tags you have and how often you add seasonal tags and then you could get non-seasonal tags:
$genres = Tag::whereNotIn('id', $arrayOFSesonalIds)->get()->lists('name');
If you're looking for a solution to exclude years in data fields that are stored as strings (CHAR/VARCHAR), NOT LIKE is probably the best way to go about it going off of your description of the problem. If the dates you're checking are DATE/DATETIME/TIMESTAMP you can use the YEAR() MySQL function to yank the year out of the field to which you wish to compare.
If not, could you provide the output of DESCRIBE tablename for the table on which you wish to perform this action?

Table design for temporary table accessed by multiple processes and stores 1,000,000,000+ 4-column rows

I am using MySQL for temporary storage of the result of one billion or more results, where the results are calculated by processes executing in parallel.
Each result is calculated using a function [f] on the representations [r1] and [r2] of objects identified respectively by [o1] and [o2].
Currently, I use three tables to execute this process:
(1) A table mapping object identifiers to their representations:
mysql> describe v2_3282_fp;
+----------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------+------+-----+---------+-------+
| objid | text | YES | | NULL | |
| representation | text | YES | | NULL | |
+----------------+------+------+-----+---------+-------+
(2) A table holding jobs that each compute process should retrieve amd calculate:
mysql> describe v2_3282_job;
+----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+----------------+
| jobid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| workerid | int(11) | YES | | NULL | |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
+----------+---------------------+------+-----+---------+----------------+
(3) A table holding the results of compute jobs:
mysql> describe v2_3282_res;
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| resultid | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| pairid1 | text | YES | | NULL | |
| pairid2 | text | YES | | NULL | |
| pairscore | double(36,18) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
(the pairscore type is dynamically determined during execution, and not fixed to (36,18) .)
Once the representations have been registered, one process continually scans the result table for new results to transfer to an object existing in memory, and the remaining processes retrieve jobs to compute until they receive a job with a pair of identifiers signalling the end of computation.
During unit tests with 1,000,000 or so computations, this system works just fine.
However, as the demands to use this system have grown to 1,000,000,000+, I see that the system eventually gets bogged down in swapping back and forth between memory and disk.
When I check the system memory and swap space in use, the system memory used is completely used, but typically less than 20% of swap is used.
I have read that MySQL performance is best when entire tables can be read into memory, and resorting to disk I/O is the major bottleneck.
This seems to be the case for me as well, as running computations on my systems with 12 GB and 16 GB of RAM eventually requires more and more time between worker process cycles, though my lone system with 64 GB never seems to encounter this issue.
While the straightforward answer is, "Hey buddy, buy more RAM.", I think there is a more fundamental design issue that is causing my system to degrade as the demands grow. I know that MySQL is a well-engineered product widely used, and that database and table design consideration can greatly impact performance.
So without resorting to the brute force resolution of buying more memory, I am looking for suggestions on how to improve the engineering of the MySQL table design I came up with.
While I know the basics of MySQL table normalization and can create queries to implement my needs, I do not know much about each type of database engine, the details of indexing, and other database-specific design considerations.
The questions I have are:
(1) Would performance be any different if I split the result and job tables into smaller tables instead of single large tables? (I think not.)
(2) I currently issue a limit clause programatically to retrieve a fixed number of results in each retrieval cycle. However, I don't know if this can be further optimized over the simple "SELECT ... FROM [result table] LIMIT start, size". (I think so.)
(3) Does it make sense to tell the worker processes to sleep between cycles in order to let MySQL "catch up"? (I think not.)
My appreciation in advance for any advice from those experienced in database and table design.

A dynamic content solution for a flexible mysql table

I've been thinking a lot trying to figure out how to make a flexible system to hold many values trying to avoid the option of adding more fields to table in the future.
The only thing I could think off, is to make a table that will look like this:
CREATE TABLE IF NOT EXISTS `form_data` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(50) NOT NULL,
`value` varchar(500) default NULL,
`form_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
)
+--------+---------+----------+--------+
| id | name | value | form_id|
+--------+---------+----------+--------+
| 100 |fullname | Steve | 1 |
+--------+---------+----------+--------+
| 101 |email |ab#c.com | 1 |
+--------+---------+----------+--------+
| 102 |fullname | John | 1 |
+--------+---------+----------+--------+
| 103 |email |cd#c.com | 1 |
+--------+---------+----------+--------+
This way, I could save each value in a row, and it would be as dynamic as I'd want.
I'm aware of the bad performance in a very long tables.
Now I've also figured out how to make the View(front end) of the values in a "Regular" table. Looks just like a normal table.
+--------+---------+----------+
| ID | Email |Fullname |
+--------+---------+----------+
| 1 |ab#c.com | Steve |
+--------+---------+----------+
| 2 |cd#c.com | John |
+--------+---------+----------+
Now I want to create a temporary table instead of PHP loops.
Any ideas how to make this work?
How do I create a stored procedure that will receive the form_id as parameter and will return a table like this?
Congratulations. You have re-invented the Entity-Attribute-Value model.
This model has existed for quite a while, but has proven to perform quite bad in a relational database system. You should probably not use it.
This answer makes a nice list of the pro and cons of EAV. The biggest pro is what you have discovered, it's easier to design. The biggest con is what I'm telling here: it's worse on performance.
Since usually you design far less often than your queries run, it might be better to think a bit longer while designing, and have faster queries.
NoSQL is considered more dynamic
Try the schemaless approach.
this may be a goot starting point:http://www.igvita.com/2010/03/01/schema-free-mysql-vs-nosql/

Updating a sequence of rows

I have a table sites and basically a travelling salesman problem. My boss wants to select a bunch of sites out of the list, then sort them manually into a visit order. I have looked for similar questions, but they were not targeted at MySQL, and those that were didn't provide a reasonable solution for my situation. I didn't do Computer Science at university, so hopefully this is bread-and-butter stuff for some of you out there.
I would like to do something like the following pseudo code:
UPDATE sites SET run_order=0 WHERE selected='false';
UPDATE sites SET run_order=AUTO_SEQUENCE(DESC FROM 6) WHERE site_id=SEQUENCE(23,17,9,44,2,14);
The latter of those would have the same effect as:
UPDATE sites SET run_order=6 WHERE site_id=23;
UPDATE sites SET run_order=5 WHERE site_id=17;
UPDATE sites SET run_order=4 WHERE site_id=9;
UPDATE sites SET run_order=3 WHERE site_id=44;
UPDATE sites SET run_order=2 WHERE site_id=2;
UPDATE sites SET run_order=1 WHERE site_id=14;
Since I am running this via PHP, I don't want to have to issue many individual queries, even though the number of sites my boss could visit in a day is of course limited by the internal combustion engine.
My SQL table looks like this:
+---------------+----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+----------------------+------+-----+---------+----------------+
| site_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| ... | | | | | |
| selected | enum('false','true') | NO | | false | |
| run_order | int(10) unsigned | NO | | 0 | |
+---------------+----------------------+------+-----+---------+----------------+
I think this is the code you are looking for.
http://www.karlrixon.co.uk/articles/sql/update-multiple-rows-with-different-values-and-a-single-sql-query/