SQL big tables optimization - mysql

I'm at the moment developping a quite big application that will manipulate a lot of data.
I'm designing the data model and I wonder how to tune this model for big amount of data. (My DBMS is MySQL)
I have a table that will contain objects called "values". There are 6 columns called :
id
type_bool
type_float
type_date
type_text
type_int
Depending of the type of that value (that is written elsewhere), one of these columns has a data, the others are NULL values.
This table is aimmed to contain millions lines (growing very fastly). It's also going to be read a lot of times.
My design is going to make a lot of lines with few data. I wonder if it's better to make 5 different tables, each will contain only one type of data. With that solution there would be much more jointures.
Can you give me a piece of advice ?
Thank you very much !
EDIT : Description of my tables
TABLE ELEMENT In the application there are elements thats contains attributes.
There will be a LOT of rows.
There is a lot of read/write, few update/delete.
TABLE ATTRIBUTEDEFINITION Each attribute is described (design time) in the table attributeDefinition that tells which is the type of the attribute.
There will not be a lot of rows
There is few writes at the begining but a LOT of reads.
TABLE ATTRIBUTEVALUE After that, another table "attributeValue" contains the actual data of each attributeDefinition for each element.
There will be a LOT of rows ([nb of Element] x [nb of attribute])
There is a LOT of read/write/UPDATE
TABLE LISTVALUE *Some types are complex, like the list_type. The set of values available for this type are in another table called LISTVALUE. The attribute value table then contains an id that is a key of the ListValue Table*
Here are the create statements
CREATE TABLE `element` (
`id` int(11),
`group` int(11), ...
CREATE TABLE `attributeDefinition` (
`id` int(11) ,
`name` varchar(100) ,
`typeChamps` varchar(45)
CREATE TABLE `attributeValue` (
`id` int(11) ,
`elementId` int(11) , ===> table element
`attributeDefinitionId` int(11) , ===> table attributeDefinition
`type_bool` tinyint(1) ,
`type_float` decimal(9,8) ,
`type_int` int(11) ,
`type_text` varchar(1000) ,
`type_date` date,
`type_list` int(11) , ===> table listValue
CREATE TABLE `listValue` (
`id` int(11) ,
`name` varchar(100), ...
And there is a SELECT example that retrieve all elements of a group that id is 66 :
SELECT elementId,
attributeValue.id as idAttribute,
attributeDefinition.name as attributeName,
attributeDefinition.typeChamps as attributeType,
listValue.name as valeurDeListe,
attributeValue.type_bool,
attributeValue.type_int,
DATE_FORMAT(vdc.type_date, '%d/%m/%Y') as type_date,
attributeValue.type_float,
attributeValue.type_text
FROM element
JOIN attributeValue ON attributeValue.elementId = element.id
JOIN attributeDefinition ON attributeValue.attributeDefinitionId = attributeDefinition.id
LEFT JOIN listValue ON attributeValue.type_list = listValue.id
WHERE `e`.`group` = '66'
In my application, foreach row, I print the value that corresponds to the type of the attribute.

As you are only inserting into a single column each time, create a different table for each data type - if you are inserting large quantities of data you will be wasting a lot of space with this design.
Having fewer rows in each table will increase index lookup speed.
Your column names should describe the data in them, not the column type.
Read up on Database Normalisation.

Writing will not be an issue here. Reading will
You have to ask yourself :
how often are you gonna query this ?
are old data modified or is it just "append" ?
==> if the answers are frequently / append only, or minor modification of old data, a cache may solve your read issues, as you won't query the base so often.

There will be a lot of null fields at each row. If the table is not big ok, but as you said there will be millions of rows so you are wasting space and the queries will take longer to execute. Do someting like this:
table1
id | type
table2
type | other fields

Advice I have, although it might not be the kind you want :-)
This looks like an entity-attribute-value schema; using this kind of schema leads to all kind of maintenance / performance nightmares:
complicated queries to get all values for a master record (essentially, you'll have to left join your result table N times with itself to obtain N attributes for a master record)
no referential integrity (I'm assuming you'll have lookup values with separate master data tables; you cannot use foreign key constraints for this)
waste of disk space (since your table will be sparsely filled)
For a more complete list of reasons to avoid this kind of schema, I'd recommend getting a copy of SQL Antipatterns

Finally I tried to implement both solutions and then I benched them.
For both solution, there were a table element and a table attribute definition as follow :
[attributeDefinition]
| id | group | name | type |
| 12 | 51 | 'The Bool attribute' | type_bool |
| 12 | 51 | 'The Int attribute' | type_int |
| 12 | 51 | 'The first Float attribute' | type_float |
| 12 | 51 | 'The second Float attribute'| type_float |
[element]
| id | group | name
| 42 | 51 | 'An element in the group 51'
First Solution (Best one)
One big table with one column per type and many empty cells. Each value of each attribute of each element.
[attributeValue]
| id | element | attributeDefinition | type_int | type_bool | type_float | ...
| 1 | 42 | 12 | NULL | TRUE | NULL | NULL...
| 2 | 42 | 13 | 5421 | NULL | NULL | NULL...
| 3 | 42 | 14 | NULL | NULL | 23.5 | NULL...
| 4 | 42 | 15 | NULL | NULL | 56.8 | NULL...
One table for attributeDefinition that describe each attribute of every element in a group.
Second Solution (Worse one)
8 tables, one for each type :
[type_float]
| id | group | element | value |
| 3 | 51 | 42 | 23.5 |
| 4 | 51 | 42 | 56.8 |
[type_bool]
| id | group | element | value |
| 1 | 51 | 42 | TRUE |
[type_int]
| id | group | element | value |
| 2 | 51 | 42 | 5421 |
Conclusion
My bench was first looking at the database size. I had 1 500 000 rows in the big table which means approximatly 150 000 rows in each small table if there are 10 datatypes.
Looking in phpMyAdmin, sizes are nearly exactly the same.
First Conclusion : Empty cells doesn't take place.
After that, my second bench was for performance tests, getting all values of all attributes of all elements in one group. There are 15 groups in the database. Each group has :
400 elements
30 attributes per element
So that is 12 000 rows in [attributeValue] or 1200 rows in each table [type_*].
The First SELECT only does one join between [attributeValue] and [element] to put a WHERE on the group.
The second SELECT uses a UNION with 10 SELECT in each table [type_*].
That second SELECT is 10 times longer !
Second Conclusion : One table is better that many.

Related

How to refer to a record in a table without a primary key

I have created a table without any primary key and it contains some exactly identical records.
How do I update or view a record using SQL statements?
The structure of my table is like :-
+----------------+-------+---------+----------+
| Name | class | section | City |
+----------------+-------+---------+----------+
| Mohit Yadav | 10 | A | Neemrana |
| Mohit Yadav | 10 | A | Neemrana |
| Janvi Yadav | 10 | A | Neemrana |
| Jaspreet Singh | 11 | B | Jaipur |
| Jaspreet Singh | 11 | B | NULL |
+----------------+-------+---------+----------+
Can we refer to the second record and change the class to 11th using update command.
Something like this would work:-
UPDATE <SOMETBL> SET CLASS='11' WHERE {INDEX_OF_RECORD=1};
Please rectify the part written inside the curly brackets so that I can refer to a record using its index.
First of all, not having a primary key is not a good idea at all, it is always a good practice to have the so-called ID column. But as it is the case now, there would be some ways.
The first and second records are exactly identical, as you said. So there is no actual difference between them to distinguish. So it doesn't matter at all to change the first row or the second one, and a good approach to achieve so is to put limitation on number of rows the update query affect on. you can simply use this
UPDATE <SOMETBL> SET CLASS='11' WHERE
NAME ='Mohit Yadav' AND
CLASS ='10' AND
SECTION ='A' AND
CITY ='Neemrana'
LIMIT 1;
The easiest way to solve this is to add an auto incrementing column and then refer to the record by its now unique int:
ALTER TABLE `t` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY

Avoid Duplicate Records with BeforeChange Table Event

I have a situation in MS Access database that I must prevent duplicate records based on combination of three attributes:
StudentNumber
ColleagueID
TypeOfAttending
So, for one combination (StudentNumber & ColleagueID) I have three types of attending: A, B and C.
Here is an example:
+---------------+-------------+---------------+
| StudentNumber | ColleagueID | AttendingType |
+---------------+-------------+---------------+
| 100 | 10 | A |
| 100 | 10 | B |
| 100 | 10 | C |
| 100 | 11 | A |
| 100 | 11 | B |
| 100 | 11 | C |
| 100 | 11 | C |
+---------------+-------------+---------------+
So last row would not be acceptable.
Does anyone have any idea?
As noted, you could choose all 3 as a PK. Or you can even create a unique index on all 3 columns. These two ideas are thus code free.
Last but least, you could use a Before change macro,and do a search (lookup) in the table to check if the existing record exists. So far, given your information, likely a unique index is the least effort, and does not require you to change the PK to all 3 columns (which as noted is a another solution).
So, you could consider a before change macro. And use this:
Lookup a Record in MyTable
Where Condition = [z].[Field1]=[MyTable].[Field1] And
[z].[Field2]=[MyTable].[Field2] And
[z].[ID]<>[MyTable].[ID]
Alias Z
RaiseError -123
Error Description: There are other rows with this data
So, you can use a data macro, use the before change table macro. Make sure you have the raise error code indented "inside" of the look up code. And note how we use a alias for the look up, since the table name (MyTable) is already in context, and is already the current row of data, so we lookup using "z" as a alias to distinguish between the current row, and that of lookup record.
So, from a learning point of view, the above table macro can be used, but it likely less work and effort to simply setup a uniquie index on all 3 columns.

ungroup table, manipulate columns and convert to a row

I have an Excel table (can be converted to XML or CSV for manipulating) of this structure:
| License-plate | Parking | Fuel | Cleaning |
---------------------------------------------
| 1111AAA | 234 | 21 | 1244 |
| 2222AAA | 22 | 12 | 644 |
| 3333BBB | 523 | 123 | 123 |
Which is a monthly spending for parking, fuel, etc. per car/month.
License plate is unique value in the table.
I need to convert this table in this to import it to MySQL, but I don't know how to do that and which tool is good for it:
| License-plate | Concept | Amount |
-------------------------------------
| 1111AAA | Parking | 234 |
| 1111AAA | Fuel | 21 |
| 1111AAA | Cleaning | 1244 |
| 2222AAA | Parking | 22 |
| 2222AAA | Fuel | 12 |
| 2222AAA | Cleaning | 644 |
| ....... | ........ | ..... |
In the result table License-plate is not the unique value, and it's repeated for the number of concepts it have.
UPD: Just discovered that it can be called denormalized data (maybe not exactly).
I would do it with MySQL, the following way:
Import the table (after converting it to CSV) into MySQL. Let's call it source
CREATE TABLE source (
License_Plate char(7) primary key,
Parking int(8) unsigned,
Fuel int(8) unsigned,
Cleaning int(8) unsigned
);
LOAD DATA INFILE 'path/to/file' INTO TABLE source FIELDS TERMINATED BY ',';
Create another table with the desired final structure, let's call it destination
CREATE TABLE destination (
License_Plate char(7),
Concept varchar(10),
Amount int(8) unsigned
);
Perform the following queries
INSERT INTO destination
SELECT License_Plate, 'Parking' as Concept, Parking as Amount
FROM source
INSERT INTO destination
SELECT License_Plate, 'Fuel' as Concept, Fuel as Amount
FROM source
INSERT INTO destination
SELECT License_Plate, 'Cleaning' as Concept, Cleaning as Amount
FROM source
Things to consider:
I declared License_Plate as a primary key, just based on your example. This might not be the case if it repeats in real data. Also, if you have more than one row on the source table for the same license plate, you probably need to adjust my 3 queries to aggregate values.
Also, the datatypes are adjusted to the sample data, you might need to change it if you have values with more than 8 digits, for instance.
LOAD DATA is one way in which you can upload your CSV. It has many options, you should check them out. You can also do it with some tools, so as not to write that statement.
Last, those table names were chosen as an example, you should come up with better ones, that represent your problem domain.
Hope this helps you.
The comment of #pnuts helped me. It is very easy solution and can be done in Excel. Thank you!
The solution is: Convert matrix to 3-column table ('reverse pivot', 'unpivot', 'flatten', 'normalize')

SQL Design Decision: Should I merge these tables?

I'm attempting to design a small database for a customer. My customer has an organization that works with public and private schools; for every school that's involved, there's an implementation (a chapter) at each school.
To design this, I've put together two tables; one for schools and one for chapters. I'm not sure, however, if I should merge the two together. The tables are as follows:
mysql> describe chapters;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| school_id | int(10) unsigned | NO | MUL | | |
| is_active | tinyint(1) | NO | | 1 | |
| registration_date | date | YES | | NULL | |
| state_registration | varchar(10) | YES | | NULL | |
| renewal_date | date | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
7 rows in set (0.01 sec)
mysql> describe schools;
+----------------------+------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------------------------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| full_name | varchar(255) | NO | MUL | | |
| classification | enum('high','middle','elementary') | NO | | | |
| address | varchar(255) | NO | | | |
| city | varchar(40) | NO | | | |
| state | char(2) | NO | | | |
| zip | int(5) unsigned | NO | | | |
| principal_first_name | varchar(20) | YES | | NULL | |
| principal_last_name | varchar(20) | YES | | NULL | |
| principal_email | varchar(20) | YES | | NULL | |
| website | varchar(20) | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+----------------------+------------------------------------+------+-----+---------+----------------+
12 rows in set (0.01 sec)
(Note that these tables are incomplete - I haven't implemented foreign keys yet. Also, please ignore the varchar sizes for some of the fields, they'll be changing.)
So far, the pros of keeping them separate are:
Separate queries of schools and
chapters are easier. I don't know if
it's necessary at the moment, but
it's nice to be able to do.
I can make a chapter inactive
without directly affecting the
school information.
General separation of data - the fields in
"chapters" are directly related to
the chapter itself, not the school
in which it exists. (I like the
organization - it makes more sense
to me. Also follows the "nothing but the key" mantra.)
If possible, we can collect school
data without having a chapter
associated with it, which may make
sense if we eventually want people
to select a school and autopopulate
the data.
And the cons:
Separate IDs for schools and
chapters. As far as I know, there
will only ever be a one-to-one
relationship between the two, so
doing this might introduce more
complexity that could lead to errors
down the line (like importing data
from a spreadsheet, which is unfornately
something I'll be doing a lot of).
If there's a one-to-one ratio, and
the IDs are auto_increment fields,
I'm guessing that the chapter_id and
school_id will end up being the same - so why not just put them in a single table?
From what I understand, the chapters
aren't really identifiable on their
own - they're bound to a school, and
as such should be a subset of a
school. Should they really be
separate objects in a table?
Right now, I'm leaning towards keeping them as two separate tables; it seems as though the pros outweigh the cons, but I want to make sure that I'm not creating a situation that could cause problems down the line. I've been in touch with my customer and I'm trying to get more details about the data they store and what they want to do with it, which I think will really help. However, I'd like some opinions from the well-informed folks on here; is there anything I haven't thought of? The bottom line here is just that I want to do things right the first time around.
I think they should be kept separate. But, you can make the chapter a subtype of a school (and the school the supertype) and use the same ID. Elsewhere in the database where you use SchoolID you mean the school and where you use ChapterID you mean the chapter.
CREATE TABLE School (
SchoolID int unsigned NOT NULL AUTO_INCREMENT,
CONSTRAINT PK_School PRIMARY KEY (SchoolID)
)
CREATE TABLE Chapter (
ChapterID int unsigned NOT NULL,
CONSTRAINT PK_Chapter PRIMARY KEY (ChapterID)
CONSTRAINT FK_Chapter_School FOREIGN KEY (ChapterID) REFERENCES School (SchoolID)
)
Now you can't have a chapter unless there's a school first. If such a time occurred that you had to allow multiple chapters per school, you would recreate the Chapter table with ChapterID as identity/auto-increment, add a SchoolID column populated with the same value and put the FK on this one to School, and continue as before, only inserting the ID to SchoolID instead of ChapterID. If MySQL supports inserting explicit values to an autoincrement column, then making it SchoolID autoincrement ahead of time could save you trouble later (unless switching a regular column to autoincrement is supported in which case no issues there).
Additional benefits of keeping them separate:
You can make foreign key relationships directly with SchoolID or ChapterID so that the data you're storing is always correct (for example, if no chapter exists yet you can't store related data for such a thing until it is created).
Querying each table separately will perform better as the rows don't contain extraneous information.
A school can be created with certain required columns, but the chapter left uncreated (temporarily). Then, when it is created, you can have some NOT NULL columns in it as well.
keep them separate.
they may be 1-1 currently... however these are clearly separate concepts.
will they eventually want to input schools which do not have chapters? perhaps as part of a sales lead system?
can there really only be one chapter per school or just one active chapter ? what about across time? is it possible they will request a report with all chapters in the past 10 years at x school ?
You said the links will always be 1 to 1, but does a school always have a chapter can it change chapters? If so, then keeping chapters separate is a good idea.
Another reason to keep them separate is if the amount of information about the two entities combined would make the length of the records longer than the database backend can handle. One-to_one tables are often built to keep the amount of data that needs to be stored in a record down to an appropriate size.
Further is the requirement a firm 1-1 or is does it have the potential to be 1-many? If the second, make it a separate table now. Id there the potential to have schools without chapters? Again I'd keep them separate.
And how are you intending to query this data, will you generally need the data about both the chapter and school in the same queries, then you might put them in one table if you are sure there is no possibility of it turning into a 1-many relationship. However a proper join with the join fields indexed should be fast anyway.
I tend to see these as separate entities and would keep them separte unless there was a critcal performance problem that would lead to putting them to gether. I think that having separate entities in separate table from the start tends to be less risky than putting them together. And performance would normally be perfectly acceptable as long as the indexing is correct and may even be better if you don't normally need to query data from both tables all the time.

Database "pointers" to rows?

Is there a way to have "pointers to rows" in a database?
for example I have X product rows, all these rows represent distinct products but many have the same field values except their "id" and "color_id" are different.
I thought of just duplicating the rows but this could be error prone, plus making a small change would have to be done on several rows, again buggy.
Question: Is there a way to fill some rows fully, then use a special value to "point to" certain field values?
For example:
id | field1 | field2 | field3 | color_id
-----------------------------------------------
1 | value1 | value2 | value3 | blue
2 | point[1] | point[1] | point[1] | red (same as row 1, except id and color)
3 | point[1] | point[1] | point[1] | green (same as row 1, except id and color)
4 | valueA | valueB | valueC | orange
5 | point[4] | point[4] | point[4] | brown (same as row 4, except id and color)
6 | valueX | valueY | valueZ | pink
7 | point[6] | point[6] | point[6] | yellow (same as row 6, except id and color)
I'm using MySQL, but this is more of a general question. Also if this is goes completely against database theory, some explanation of why this is bad would be appreciated.
This does go against database design. Look for descriptions of normalization and relational algebra. It is bad mainly because of the comment you have made "duplicating the rows but this could be error prone, plus making a small change would have to be done on several rows, again buggy."
The idea of relational databases is to act on sets of data and find things by matching on primary and foreign keys and absolutely not to use or think of pointers at all.
If you have common data for each product, then create a product table
create table product (
product_id int,
field1 ...,
field2 ...,
field3
)
with primary key on product_id
The main table would have fields id, color_id and product_id
if product table looks like
product_id | field1 | field2 | field3
-----------------------------------------------
1 | value1 | value2 | value3
2 | valueA | valueB | valueC
3 | valueX | valueY | valueZ
The main table would look like
id | product_id | color_id
--------------------------------
1 | 1 | blue
2 | 1 | red
3 | 1 | green
4 | 2 | orange
5 | 2 | brown
6 | 3 | pink
7 | 3 | yellow
Sure there is a way to have pointers to rows in a database. Just don't use a relational DBMS. In the 1960s and 1970s, there were several very successful DBMS products that were based entirely on linking records together by embedding pointers to records inside other records. Perhaps the most well known of these was IMS.
The down side of having pointers to records in other records was that the resulting database was far less flexible than relational databases ended up being. For predeterimned access paths, a database built on a network of pointers is actually faster than a relational database. But when you want to combine the data in multiple ways, the lack of flexibility will kill you.
That is why relational DBMSes took over the field in the 1980s and 1990s, although hierarchical and network databases still survive for fairly specialized work.
As others have suggested, you should learn normalization. When you do, you will learn how to decompose tables into smaller tables with fewer coulmns (fields) in each table. When you need to use the data in joined fashion, you can use a relational join to put the data back together. Relational joins can be almost as fast as navigating by pointers, especially if you have the right indexes built.
Normalization will help you avoid harmful redundancy, which is the problem you highlighted in your question.
One way of doing this is to separate the columns that seem to have repeated data and put that in a separate table. Give each of the rows in this new table a unique id. Add a column to the original table which contains the id in the new table. Then use a FOREIGN KEY relationship between the original table and the new table's id column.
well this would be called normalization under normal circumstances .. the whole point of it is to deal with that kinda scenarios .. so no it cant be done the way u want to do it.. u will need to normalize the data properly.
Create separate tables for the field1, field2 and field three values.
Put existing values there, and reference them by putting their id's into your current table.
If you're using common string values, it's good to store the strings in a separate table and refer to them with foreign keys. If you're storing anything like an integer, it wouldn't be worth it - the size of the pointer would be comparable to the size of the data itself.
It does go against database theory because you're throwing the relational part of databases out the window.
The way to do it is to make an ObjectID column that contains the key of the row you want to point to.
id | field1 | field2 | field3 | color_id | object_id |
------------------------------------------------------------
1 | value1 | value2 | value3 | blue
2 | null | null | null | red | 1 |
3 | null | null | null | green | 1 |
4 | valueA | valueB | valueC | orange
5 | null | null | null | brown | 4 |
6 | valueX | valueY | valueZ | pink
7 | null | null | null | yellow | 6 |
But remember: This is a bad idea. Don't do it. If you did want to do it, that would be how.
There are instances where it's required; but after dealing with a system that was pervasive in this, I'd always try to find another way, even if it means duplicating data and letting your business layer keep everything straight.
I work in a system where this was done throughout the system, and it's maddening when you have to recreate the functionality of relationships because someone wanted to be clever.
The way you would want to implement this in a database would be to create two tables:
object_id | field1 | field2 | field3
and
instance_id | object_id | colour
And then the rows of the second would point to the first, and you could generate the full table of data you want on the fly by
select t1.*, t2.colour from t1 join t2 on (t1.object_id=t2.object_id)
You should probably have two tables with a foreign key relationship.
Example
Products:
Id
field1
field2
field3
ProductColors:
Id
ProductId
Color