Normalizing MySQL data - mysql

I'm new to MySQL, and just learned about the importance of data normalization. My database has a simple structure:
I have 1 table called users with fields:
userName (string)
userEmail (string)
password (string)
requests (an array of dictionaries in JSON string format)
data (another array of dictionaries in JSON string format)
deviceID (string)
Right now, this is my structure. Being very new to MySQL, I'm really not seeing why my above structure is a bad idea? Why would I need to normalize this and make separate tables? That's the first question-why? (Some have also said not to put JSON in my table. Why or why not?)
The second question is how? With the above structure, how many tables should I have, and what would be in each table?
Edit:
So maybe normalization is not absolutely necessary here, but maybe there's a better way to implement my data field? The data field is an array of dictionaries: each dictionary is just a note item with a few keys (title, author, date, body). So what I do now is, which I think might be inefficient, every time a user composes a new note, I send that note from my app to PHP to handle. I get the JSON array of dictionaries already part of that user's data, I convert it to a PHP array, I then add to the end of this array the new note, convert the whole thing back to JSON, and put it back in the table as an array of dictionaries. And this process is repeated every time a new note is composed. Is there a better way to do this? Maybe a user's data should be a table, with each row being a note-but I'm not really sure how this would work?

The answer to all your questions really depends on what the JSON data is for, and whether you'll ever need to use some property of that data to determine which rows are returned.
If your data truly has no schema, and you're really just using it to store data that will be used by an application that knows how to retrieve the correct row by some other criteria (such as one of the other fields) every time, there's no reason to store it as anything other than exactly as that application expects it (in this case, JSON).
If the JSON data DOES contain some structure that is the same for all entries, and if it's useful to query this data directly from the database, you would want to create one or more tables (or maybe just some more fields) to hold this data.
As a practical example of this, if the data fields contains JSON enumerating services for that user in an array, and each service has a unique id, type, and price, you might want a separate table with the following fields (using your own naming conventions):
serviceId (integer)
userName (string)
serviceType (string)
servicePrice (float)
And each service for that user would get it's own entry. You could then query for users than have a particular service, which depending on your needs, could be very useful. In addition to easy querying, indexing certain fields of the separate tables can also make for very QUICK queries.
Update: Based on your explanation of the data stored, and the way you use it, you probably do want it normalized. Something like the following:
# user table
userId (integer, auto-incrementing)
userName (string)
userEmail (string)
password (string)
deviceID (string)
# note table
noteId (integer, auto-incrementing)
userId (integer, matches user.userId)
noteTime (datetime)
noteData (string, possibly split into separate fields depending on content, such as subject, etC)
# request table
requestId (integer, auto-incrementing)
userId (integer, matches user.userId)
requestTime (datetime)
requestData (string, again split as needed)
You could then query like so:
# Get a user
SELECT * FROM user WHERE userId = '123';
SELECT * FROM user WHERE userNAme = 'foo';
# Get all requests for a user
SELECT * FROM request WHERE userId = '123';
# Get a single request
SELECT * FROM request WHERE requestId = '325325';
# Get all notes for a user
SELECT * FROM note WHERE userId = '123';
# Get all notes from last week
SELECT * FROM note WHERE userId = '123' AND noteTime > CURDATE() - INTERVAL 1 WEEK;
# Add a note to user 123
INSERT INTO note (noteId, userId, noteData) VALUES (null, 123, 'This is a note');
Notice how much more you can do with normalized data, and how easy it is? It's trivial to locate, update, append, or delete any specific component.

Normalization is a philosophy. Some people think it fits their database approach, some don't. Many modern database solutions even focus on denormalization to improve speeds.
Normalization often doesn't improve speed. However, it greatly improves the simplicity of accessing and writing data. For example, if you wanted to add a request, you would have to write a completely new JSON field. If it was normalized, you could simply add a row to a table.
In normalization, "array of dictionaries in JSON string format" is always bad. Array of dictionaries can be translated as list of rows, which is a table.
If you're new to databases: NORMALIZE. Denormalization is something for professionals.

A main benefit of normalization is to eliminate redundant data, but since each user's data is unique to that user, there is no benefit to splitting this table and normalizing. Furthermore, since the front-end will employ the dictionaries as JSON objects anyway, undue complication and a decrease in performance would result from trying to decompose this data.
Okay, here is a normalized mySQL data-model. Note: you can separate authors and titles into two tables to further reduce data redundancy. You can probably use similar techniques for the "requests dictionaries":
CREATE TABLE USERS(
UID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
userName varchar(255) UNIQUE,
password varchar(30),
userEmail varchar(255) UNIQUE,
deviceID varchar(255)
) ENGINE=InnoDB;
CREATE TABLE BOOKS(
BKID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
FKUSERS int,
Title varchar(255),
Author varchar(50)
) ENGINE=InnoDB;
ALTER TABLE BOOKS
ADD FOREIGN KEY (FKUSERS)
REFERENCES USERS(UID);
CREATE TABLE NOTES(
ID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
FKUSERS int,
FKBOOKS int,
Date date,
Notes text
) ENGINE=InnoDB;
ALTER TABLE NOTES
ADD FOREIGN KEY BKNO (FKUSERS)
REFERENCES USERS(UID);
ALTER TABLE NOTES
ADD FOREIGN KEY (FKBOOKS)
REFERENCES BOOKS(BKID);

In your case, I will abstract out the class that handles this table. Then keep the data normalized. if in future, the data access patterns changes and i need to normalized the data, i css just do so with less impact on the program. I just need to change the class that handles this set of data to query the normalized tables , but return the data as if the database structure never changed.

Related

how to create dynamic data fields in sql database

There is a requirement that arises to handle dyanmic data fields in database level. Say we have a table called Employee and that table has a name, surname, and contact no fields ( 3 basic fields). So as the application progresses, the requirement is that the database and the application should be able to add (handle) dynamic data fields that can be added with type into the database.
Ex: A user will add data of birth, address field dynamically to the Employee table which has basic 3 fields mainly.
The problem is how to cater to this requirement the optimum way?
there is a picture I have designed tables to cater to this, But I am open for industry-standard optimum way of achieving this without having future problems
Please collaborate with this.
You basically have four options for handling such dynamic fields:
Modify the base table structure whenever a new column is added.
Using JSON to store the values.
Using a EAV model (entity-attribute-model).
Basically (1) but storing the additional values in a separate table or separate table per user.
You have not provided enough information in the question to determine which of these is most appropriate for your data model.
However, here is a quick run-down of strengths and weaknesses:
For modifying the table: On the downside, modifying a table is an expensive operation (especially as the table gets bigger). On the upside, the columns will be visible to all users and have the appropriate type.
For JSON: JSON is quite flexible. However, JSON incurs very large storage overheads because the name of each field is repeated every time it is used. In addition, you don't have a list of all the added fields (unless you maintain that in a separate table).
For EAV: EAV is flexible, but not quite as flexible as JSON. The problem is the value column is a single type (usually a string) or accessing the data gets more complicated. Like JSON, this repeats the "name" of the value every time it is used. However, this is often a key to another table, so the overhead is less.
For a separate table for each user: This primary advantage here is isolating users from each other. If this is a requirement, then this might be the way to go (although adding a userId to the EAV model would also work).
So, the most appropriate method depends on factors, such as:
Will the fields be shared among all users?
Do the additional fields all have the same type?
What are your concerns about performance and data size?
How often will new fields be added?
To have dynamic fields you can use another table where you can set properties of the user
user table has columns
userid, name, surname, contact
user_props table has columns
propertyid, userid, property, value
in user_props you can insert user properties like
INSERT INTO user_props (userid, property, value)
VALUES (1, "date_of_birth", "2010-01-10"),(1, "hobby", "Stackoverflow")
Like this you can dynamically set any number properties to user.
You might be better off using MongoDB or some other NoSQL/Schemaless database which stores your data in key => value pairs. For the fields you are certain about in advance, you can set a type (in MongoDB) so those columns will have a schema. For dynamic fields, the fields would be stored as strings and you would have to figure out the types in your code somehow.
If you need to use MySQL, in your Employee table, you could have a fourth column for custom fields - database type json. Then whenever you add a new custom field, you add the field_name, field_value and field_type. You schema could look like:
//Schema for Employee table in mysql
id: int
name: varchar
surname: varchar
custom_fields: json //eg { {field_name: DOB, field_value: 06/09/2020, field_type: date},... }
//Schema for contacts table
id: int
employee_id: int
contact: varchar
In MySQL, you could also get rid of the type (if you can do without it) in the custom_fields and structure the json to be simple key => value pairs so it looks like
{
{"key":"Age","value":"10"},
{"key":"salary","value":"40,000"},
{"key":"DOB","value":"06/09/2020"},
}
What you seem to be designing here is a variation of the Entity-Attribute-Value model. It works but it would be very cumbersome to query against a schema like that. Using a json column is a lot neater and a lot faster. Best is to use MongoDB and figure out the types in your code.
You may deal with this condition using a stored procedure including an ALTER DATABASE sentence. Something like:
DROP PROCEDURE IF EXISTS set_dynamic_table;
DELIMITER //
CREATE PROCEDURE set_dynamic_table (IN _field_name VARCHAR(50),
IN _field_type VARCHAR(20),
IN _last_field_name varchar(50))
BEGIN
DECLARE _stmt VARCHAR(1024);
SET #SQL := CONCAT('ALTER TABLE dynamic_table',
'ADD COLUMN ' _field_name, field_type, NULL AFTER ,last_field_name);
PREPARE _stmt FROM #SQL;
EXECUTE _stmt;
DEALLOCATE PREPARE _stmt;
END//
DELIMITER;
and you will e

Is a semicolon delimiter a good way to store a large number of ID #s in a mysql field?

I have a database which will store millions of Post ID#s. I need to associate with each post ID # a number of User ID #s (on the order of about 20-50 for each post ID). I was thinking of constructing a semicolon delimited list in PHP and just inserting that into a DB field on the post ID row.
Is this a relativly efficient and good way to go about doing this?
Thanks!
The long answer to this is you need to create a one-to-many association table. Proper database normalization principles dictate this.
The problem with your approach, serializing the list into the database as a semicolon-concatenated list, is the data itself is virtually useless unless you can deserialize it.
Fields of this sort:
Cannot be indexed effectively.
Can grow to exceed the storage capacity of the column.
Require context to properly utilize.
Cannot work with foreign key integrity checking.
Cannot be easily amended.
Removing entires requires re-writing the entire field.
Cannot be queried directly.
Cannot be used in JOIN operations.
You're talking about creating a simple association table:
CREATE TABLE user_posts (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT,
post_id INT
)
You'd have a UNIQUE index on user_id,post_id to ensure that you don't have duplicates. The inclusion of an id column is mostly so you can remove particular rows without having to specify user+post pairs.
No, this is a very bad idea.
A foreign key is what you want here. Basically, for every post_ID you also store the USER ID as a foreign key.
So, if you have a POSTS table, you add a column User_ID (or Poster_ID) and reference the USER ID in the USER table.
I think you should review some of the basics - please see links:
http://www.functionx.com/sql/Lesson11.htm
http://creately.com/blog/diagrams/er-diagrams-tutorial/
https://cs.uwaterloo.ca/~gweddell/cs348/errelational-handout.pdf

Decorating an existing relational SQL database with NoSql features

We have a relational database (MySql) with a table that stores "Whatever". This table has many fields that store properties of different (logical and data-) types. The request is that another 150 new, unrelated properties are to be added.
We certainly do not want to add 150 new columns. I see two other options:
Add a simple key-value table (ID, FK_Whatever, Key, Value and maybe Type) where *FK_Whatever* references the Whatever ID and Key would be the name of the property. Querying with JOIN would work.
Add a large text field to the Whatever table and serialize the 150 new properties into it (as Xml, maybe). That would, in a way, be the NoSql way of storing data. Querying those fields would mean implementing some smart full text statements.
Type safety is lost in both cases, but we don't really need that anyway.
I have a feeling that there is a smarter solution to this common problem (we cannot move to a NoSql database for various reasons). Does anyone have a hint?
In an earlier project where we needed to store arbitrary extended attributes for a business object, we created an extended schema as follows:
CREATE TABLE ext_fields
{
systemId INT,
fieldId INT,
dataType INT // represented using an enum at the application layer.
// Other attributes.
}
CREATE TABLE request_ext
{
systemId INT, // Composite Primary Key in the business object table.
requestId INT, // Composite Primary Key in the business object table.
fieldId INT,
boolean_value BIT,
integer_value INT,
double_value REAL,
string_value NVARCHAR(256),
text_value NVARCHAR(MAX),
}
A given record will have only of the _value columns set based on the data type of the field as defined in the ext_fields table. This allowed us to not lose the type of the field and it's value and worked pretty well in utilizing all the filtering methods provided by the DBMS for those data types.
My two cents!

Many to many relationships with large amount of different tables

I am having trouble developing a piece of my database schema. Currently, my app has a table of users, and a another table of events. I can easily set up a many to many relationship (using a third table) to hold information about which users are attending which events.
My problem is that events is just one feature of my app. The goal is to have a large number of different programs a user can take part in, and each will need its own table. Yet I still need to be able to call up a list of everything the user is signed up for.
Right now, I am thinking about just making one way relationships from each event table back to the user. I would then need to create a custom function (in my websites ORM) that queries each table independently and assembles a full list. I feel like this would be slow, so I am also entertaining the idea of creating a separate table that just list all the programs that users sign up for, and storing in there the info needed for my app to function. This would repeat info in my database, and in general doesn't sound as "clean", but probably would be faster.
Any suggestions as to the best way to handle relationships like this?
P.S. If it matters, I'm using Doctrine2 & Symfony2 to power my site.
In one of my web applications, I have used a this kind of construct for storing comments for any table that has integer as primary key:
CREATE TABLE Comments (
Table VARCHAR(24) NOT NULL,
RowID BIGINT NOT NULL,
Comments VARCHAR(2000) NOT NULL,
PRIMARY KEY (TABLE, RowID, COMMENTS)
);
In my case (DB2, less than 10 million rows in Comments table) it performs well.
So, applying it to your case:
CREATE TABLE Registration (
Table VARCHAR(24) NOT NULL,
RowID BIGINT NOT NULL,
User <datatype> NOT NULL,
Signup TIMESTAMP NOT NULL,
PRIMARY KEY (TABLE, RowID, User)
);
So, the 'Table' column identifies the table containing the program (say, 'Events' table). 'RowID' is the primary key in that table (e.g. PK of an entry in 'Events' table). To perform well, this requres the primary key being of same datatype in all target tables.
NoSQL solutions are cool, but pattern above works in plain old relational database.
What is unique about these event types that requires them to have their own table?
If the objects are so inherently different, make the object as simple as possible with only those things common to all Events:...
public Event
{
public Guid Id;
public string Title;
public DateTime Date;
public string Type;
public string TypeSpecificData; // serialized JSON/XML
}
// Not derived from Event, but built from it.
public SpecialEventType
{
public Guid Id;
// ... and the other common props from Event
// some kind of special prop parsed from the Event's serialized data
public string SpecialField;
}
The "type specific data" could then be used to store details about events that are not in common (that would normally require columns or new tables)... do it something like serialized XML or JSON
Map the table MTM to your Users table, and query by the basic event properties and its type.
Your code is then responsible for parsing the data using its Type property and some predefined XML schema you associate with it.
Very simple, keeps your database nice and clean, and fast, minimizes round trips. The tradeoff here is that you don't have the ability to query the DB for the specifics of a certain Event type... but for large scaling applications, with mature ORM layers, the performance tradeoff is worth it alone...
For example, now you query your data once for Events of a particular Type, build your pseudo-derived types from it, and then "query" them using LINQ.
Unless you have a ridiculous amount of types of events, querying events a user is signed up for from a few tables should not be much slower than querying the same thing from one long table of all the events.
I would take this approach, each table or collection has a user_id field which maps back to the Users table. You dont need to really create a separate function in the ORM. If each of the event types inherit from an event class, then you can just find all events by user_id.

How to store data with dynamic number of attributes in a database

I have a number of different objects with a varying number of attributes. Until now I have saved the data in XML files which easily allow for an ever changing number of attributes. But I am trying to move it to a database.
What would be your preferred way to store this data?
A few strategies I have identified so far:
Having one single field named "attributes" in the object's table and store the data serialized or json'ed in there.
Storing the data in two tables (objects, attributes) and using a third to save the relations, making it a true n:m relation. Very clean solution, but possibly very expensive to fetch an entire object and all its attributes
Identifying attributes all objects have in common and creating fields for these to the object's table. Store the remaining attributes as serialized data in another field. This has an advantage over the first strategy, making searches easier.
Any ideas?
If you ever plan on searching for specific attributes, it's a bad idea to serialize them into a single column, since you'll have to use per-row functions to get the information out - this rarely scales well.
I would opt for your second choice. Have a list of attributes in an attribute table, the objects in their own table, and a many-to-many relationship table called object attributes.
For example:
objects:
object_id integer
object_name varchar(20)
primary key (object_id)
attributes:
attr_id integer
attr_name varchar(20)
primary key (attr_id)
object_attributes:
object_id integer references (objects.object_id)
attr_id integer references (attributes.attr_id)
oa_value varchar(20)
primary key (object_id,attr_id)
Your concern about performance is noted but, in my experience, it's always more costly to split a column than to combine multiple columns. If it turns out that there are performance problems, it's perfectly acceptable to break 3NF for performance reasons.
In that case I would store it the same way but also have a column with the raw serialized data. Provided you use insert/update triggers to keep the columnar and combined data in sync, you won't have any problems. But you shouldn't worry about that until an actual problem surfaces.
By using those triggers, you minimize the work required to only when the data changes. By trying to extract sub-column information, you do unnecessary work on every select.
A variation on your 2d solution is just two tables (assuming all attributes are of a single type):
T1: |Object data columns|Object_id|
T2: |Object id|attribute_name|attribute value| (unique index on first 2 columns)
This is even more efficient when combined with 3rd solution, e.g. all of the common fields go into T1.
Sstuffing >1 attribute into the same blob is no recommended - you can not filter by attributes, you can not efficiently update them
Let me give some concreteness to what DVK was saying.
Assuming values are of same type the table would look like (good luck, I feel you're going to need it):
dynamic_attribute_table
------------------------
id NUMBER
key VARCHAR
value SOMETYPE?
example (cars):
|id| key | value |
---------------------------
| 1|'Make' |'Ford' |
| 1|'Model' |'Edge' |
| 1|'Color' |'Blue' |
| 2|'Make' |'Chevrolet'|
| 2|'Model' |'Malibu' |
| 2|'MaxSpeed'|'110mph' |
Thus,
entity 1 = { ('Make', 'Ford'), ('Model', 'Edge'), ('Color', 'Blue') }
and,
entity 2 = { ('Make', 'Chevrolet'), ('Model', 'Malibu'), ('MaxSpeed', '110mph') }.
If you are using a relational db, then I think you did a good job listing the options. They each have their pros and cons. YOU are in the best position to decide what works best for your circumstances.
The serialized approach is probably the fastest (depending on your code for de-serializing), but it means that you won't be able to query the data with SQL. If you say that you don't need to query the data with SQL, then I agree with #longneck, maybe you should use a key/value style db instead of a relational db.
EDIT - reading more of your comments, WHY are you switching to a db if speed is your main concern. What's BAD about your current XML implementation?
I used to implement this scheme:
t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties
t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.
RAW(16) is where Oracle holds GUIDs
If you want to select all properties for an object, you issue:
SELECT i.*
FROM (
SELECT id
FROM t_class
START WITH
id = (SELECT class FROM t_declaration WHERE id = :object_id)
CONNECT BY
parent = PRIOR id
) c
JOIN property p
ON p.class = c.id
LEFT JOIN
t_instance i
ON i.id = :object_id
AND i.class = p.class
AND i.property = p.property
t_property hold stuff you normally don't search on (like, text descriptions etc.)
Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.
You don't have to use fast tables and limit all your data to these four tables.
sounds like you need something lick couchdb, not an RDBMS.
if you are going to edit/manipulate/delete the attributes in later point, making a true n:m (second option) will be the one which I go for. (Or try to make it 2 table where the same attribute repeats.But data size will be high)
If you are not dealing with attributes(just capturing and showing the data) then you can go ahead and store in one field with some separator(Make sure the separator wont occur in the attribute value)
I am assuming you do not have digital attribute soup, but that there is some order to your data.
Otherwise, an RDBMS might not be the best fit. Something along NO SQL might work better.
If your objects are of different types, you should generally have one table per type.
Especially if you want to connect them using primary keys. It also helps to bring order and sanity if you have Products, Orders, Customers, etc tables, instead of just an Object and Attribute table.
Then look at your attributes. Anything that exists more than, say for 50% of the objects in that type category, make it a column in the object's table and use null when it's not being used.
Anything that is mandatory, should, of course, be defined as a NOT NULL column.
The rest, you can either have one or several "extra attributes" tables for.
You could put the attribute names into the table with the values, or normalize them out in a separate table and only use the primary key in the value table.
You may also find that you have combinations of data. For instance, a variant of an object type always has a certain set of attributes while another variant of the same object type has another set of attributes.
In that case, you might want to do something like:
MainObjectTable:
mainObjectId: PRIMARY KEY
columns...
MainObjectVariant1Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant1Columns...
MainObjectVariant2Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant2Columns...
I think the hard work, that will pay off, in the long run, is to analyze the data, find the objects and the commonly used attributes and make it into a good "object/ERD/DB" model.