How to normalize live database - mysql

I need to perform normalization on data structure. I have one table with lots of redundant data (42 columns)
few examples:
files_shit (id, filename String, upload_user, user_name, tags text, ....)
and I want to create 3 tables file, user and tags
I have almost 30 000 records.
What is the best way to copy data from file_shit to files, users and tags and creating references? (between tags and files will be another another table file_tags)

First, you cannot convert this table. You will have to use new ones. A simple way is to use this table as a staging table. Create new tables. Then select from this table and add to those.
You will have to identify the primary key for each table. Then fill up the tables (you may have to identify which table to fill first for reasons of referential integrity...etc.. ).
Sudo code eg : insert into files(columns..)Select <files columns> from files_shit group by primary_colum;
(Note - This means you will use the primary column(s) as the primary key. If you want to use autogenerated integers (optimal) you will have to perform lookups... )
Lot is dependent on the new schema and relations (which you havent defined clearly here). Hope this helps.
EDIT- Lookups
You will have an INT id field for each table.eg. file_id. These will be system generated (Mostly auto_increment). In simple words, this info is not in your current table. So, when u add a file to the file table, and it gets a file_id, you will have to 'look up' the id for this file to add to the user table to satisfy your foreign key relationships(based on how they exist).
SIMPLE EG -
Try adding additional file_id/tag_id columns to your main table.
Fill tag table first (basically the ones that dont refer anyother).
Fill main tables tag_id for each row by joining tag table (lookup).
UPDATE <mainTable> mT JOIN tag_table tT on mT.tag_pk_column= tT.tag_pk_column
SET mT.tag_id=tT.tag_id
Now insert into files ...select file_pk_col, tag_Id group by file_pk_col
-This is an example lookup for the tag table.

The simplest way is to take the database offline, create new tables, including all the required constraints, and use INSERT INTO . . . SELECT column_list FROM old_table to populate the new tables. Some data probably won't satisfy the constraints in the new tables; you'll have to fix that.
It gets more complicated if you can't take the database offline, or if you have to make the changes transparent to application programs. Triggers, rules, and updatable views will help with that.

Related

Add column to 50M+ records table? Most efficient way?

I have a table products that has over 50M records. I want to track who uploaded given product in my system but simply adding uploaded_by_id to such a huge table isn't the solutions I'm looking for. What else than a join table can I create to be able to query for products uploaded by given id in given time range?
Product.where(uploaded_by_id: #user.id, created_at: time_range) is what I need to do but I need more efficient way.
You might want to look into tools like
Soundscloud's Large Hadron Migrator or
Percona's pt-online-schema-change.
Both tools allow altering tables without locking them.
Instead of touching the main table, add another table (Vertical Partitioning). The new table would have the same PRIMARY KEY, but not AUTO_INCREMENT. The new column(s) would go into this table.
Create rows in the new table only when the new column(s) have a value.
When you don't need the new column(s), continue to read only from the old table.
When you also need the new column(s), use LEFT JOIN.

Database table setup: Multiple tables that serve the same purpose?

I need to setup a MySQL database for a bugtracker, that's paired with a changelog.
Therefore I essentially have three tables: product, version, problem, problem_solution. The reason I splitted problems and their solutions is that I want to be able to provide my users with a set of possible solutions.
Now I want to add attachments to each of these tables and manage them via the database as well. There should be pictures, PDFs, ... for each product, version and possibly for each problem and solution.
Would I rather
Create 4 attachment-tables (product_attachments, version_attachments, ...), or
Create one attachment-table and create a column stating what it is for?
If latter, how should I do it? I want to reference to the specific ID of the product, version, problem or solution using a foreign key. Should I then just create 4 columns, each of them with a foreign key and decide whether it's an attachment for a product, a version, ... depending on which of these columns is not NULL? Wouldn't this make my queries unnecessarily complex?
I say create one table, have its primary key available, and create another table of EAV type for multi-to-multi relation between attachments and other entities, with "value" corresponding to attachment ID, "entity" to foreign ID and "attribute" to a value out of a fixed set of product, version, problem, solution in any form you like (1,2,3,4?). This way the attachments will be stored in a table of id, blob structure, maybe with corresponding count column storing the amount of links in the relation table, so that an orphaned attachment could be detected and removed with ease.

Database design - which would be better?

I have multiple tables.
They all have the following fields in them:
item_title | item_description | item_thumbnail | item_keywords
Would I be better off having a single items_table with an extra item_type field and then joining with the respective table, or just keep them all in separate tables?
Depends on the context. If your items have very little differentiation and you’re certain you’re not going to have a scenario in 6 months, 12 months, 2 years where you need items separated, then go the route of one generic “items” table. If a particular item type does have specific requirements, then you can create a separate table that contains this data and create a LEFT JOIN when querying to include the extra data.
I’d also suggest looking at other database types. Judging from your scenario (lots of item types with little variance in the data stored) I think you may benefit from a document-based database engine like MongoDB rather than a relational data-based database engine like MySQL.
OK, so the tables share fields. Do they also share constraints1?
If yes, then go ahead and merge them together.
If not, you may keep them separate, of may merge them together, depending on what kind of tradeoff you are willing to make.
For example, if tables have separate foreign keys, you may keep them separate, or you may merge them into a single table, but keep FKs separate:
item_title
item_description
item_thumbnail
item_keywords
table1_id REFERENCES table1 (table1_id)
table2_id REFERENCES table2 (table2_id)
...
CHECK (
(table1_id IS NOT NULL AND table2_id IS NULL ...)
OR (table1_id IS NULL AND table2_id IS NOT NULL ...)
...
)
(NOTE: MySQL Doesn't enforce CHECK, so you'll need to do the equivalent enforcement from a trigger or client code, or use a different DBMS if you can.)
I'd need to know more about your database to figure out which is better.
with an extra item_type field and then joining with the respective table,
Never enforce FKs in code, if you can help it. Even if you merge the tables together, don't merge FKs, instead do something like the above. Enforcing FKs in code in the context of the concurrent environment (where multiple clients can try to modify the same data at the same time) is difficult to do correctly and with good performance - it's much better to let the DBMS do it for you.
BTW, what is item_keywords? It it's a comma-separated list of keywords (or similar), you'll need to normalize further and extract the keywords into their own separate table.
1 Domain (data type and CHECK), key (PRIMARY KEY and UNIQUE) and referential (FOREIGN KEY) constraints.
I believe that it is good to have as less table as possible. It is easy to maintain. It is hard to imagine that if you have 3000 type of item_type. Then, there would be 3,000 different table. So single table is good idea to me in your case. In the future, when you run into situation when you need to separate the table, you can easily do so.
So the short answer, YES.
If i understand well, you only need to normalize your schema:
items:
item_id
item_name
item_description
items_types
item_id
type_id
types
type_id
item_file_name
So this way you can have any number of items with any number of types
Is this you want to do???
I would suggest you to use one table for item and one table for type for the following reasons (assume there are 10 types).
I am not sure which programming language you are using. As a Java developer, i will have to create each entity class for each type if I have multiple tables. So i would rather have only one class and have a type as an attribute.
When you have to display all of the types in the same page, you will have to execute the select query from all 10 tables for 10 types.
When you introduce a new type, you have to write the code to for the CRUD and Business specific operations. The developer will keep on adding the code for every new type.
Basically, if you have one table for item and one table for type, you won't have to change the database schema and code for each new type you introduce. But if you are sure that, the number of types is less and won't change, you can consider using muiltiple tables.
Create two separate tables and join them as per your required output.
i.e>
1.1'st TABLE (master table==>item_type)
item_type(item_type_id,item_type_name,status)
2.2'nd TABLE(child table==>item_details)
item_details(item_id,item_type_id,item_title,item_description,item_thumbnail,item_keywords)
See more examples..
I feel signle table would be more suitable. It will avoid more joins, complication in program(Code) and errors in compare of multiple tables. Even it will be better from the management point of view like db clustering etc.
If you have so many tables which needs to have the same repeated columns then yes it is a good way to create a separate table for the common fields. This is more efficient if these repeated columns are not fixed and can be changed like adding one more column to the list of common default columns.
So how could you do that?
The idea is to create a seperate table and put the common default columns there.
This table is like a dummy table i.e. the columns can be added/deleted as needed.
For example-
Table - DefaultFields
Columns - item_title | item_description | item_thumbnail | item_keywords
You can then also be able to insert the values in the DefaultFields table dynamically in a loop like:
"INSERT INTO DefaultFields (item_table, item_title , item_description,item_thumbnail ,item_keywords) VALUES('"+ field.item_table + "','" + field.item_title + "','" + field.item_description+ "','" + field.item_thumbnail + "','" + field.item_keywords)");
NOTE: field is the object that holds the values in a table wise loop.
Then further you can alter your tables to create these default fields from DefaultFields table like:
"ALTER TABLE " + item_table+ " ADD COLUMN [" + field.item_title + "] Text"
This can be repeated for each table to alter it as needed.
In this design pattern, even if you want to:
1) add one more column or
2) delete pre existing column or
3) change pre existing column name
Then you can do so in the dummy table and the rest is updated by the ALTER table command in corresponding tables.
In my opinion... I would say no, never.
There is two reason for that:
You really want to preserve a logical meaning in your database. For now it's pretty obvious for you how it's organised. But in two month (or 1 year), will it be so evident? If somebody join the project, isn't it easier for him to understand if the different logical block of your app are separated? I mean... It's true that a human and a cat are animals. Is it still logical to store both of them inside the same box?
Performance. The shorter the table, the faster your request will be. The data will still take as much space on your disk. And i don't talk about the comparison for knowing which type of item you are looking for. I mean, if you want to select all the pages of your application, just compare the two request:
Multiple tables:
Select * from pages_tbl;
Single table:
Select * from item_tbl where type = 'page';
What will you gain from this design? No performance, no disk space, no readability. I really don't see a good reason for it.

Insert a new column in SQL

I have a DB consisting of 4 fields.My application will retrieve data from that db. I have one primary key(the id).I also want depending on the id, provide other data that will be organized in a new table. What is better? Create a new table and search again into it, or given the fact that I have already found the row because of the id, create a new element that will be a table. For example can I create a new element named info, and make it be to something like an array,as I want 11 rows,and 2 columns for the info. My SQL code so far is this:
CREATE TABLE people (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
name VARCHAR( 100 ) NOT NULL ,
sex BOOL NOT NULL DEFAULT '1',
birthyear INT NOT NULL
)
What changes do I need to make? This table is already created.
If each row in the existing table now also needs associating with an 11x2 set of data, you're best off creating another table.
Don't try to stuff 22 items of data into a single field, it's a really bad idea.
If, however, it's always the same (22 items), you could just add 22 fields. It depends on how that data is going to be used, searched, joined on, etc.
Exactly how to do that depends on your RDBMS and your interface to it. It may be easier to create a whole new table and copy the old data across. Or the environment you have may allow you to add the columns and it do the leg work for you.
I think it would be best to create a separate new table to contain the additional data. That is primarly because you have more than one record per ID in the original table.
The records in the new table would have a foreign key peopleID field linking them to the people table.
I believe you are hinting at embedding tables. Which isn't really what MySQL is meant to do. Instead, you should do the following; Create a table like that in your example. Then create a new table that will have a column for an ID (which will be the same as that in the people table) and the other various columns. You can then do an inner join to join the two together. Additionally, if you want to reference different tables for different rows, you may want to add in a column for what 'type' it is.
Alternatively, you could use a 'No-SQL' solution like Mongo. This lets you add things dynamically. But I wouldn't suggest doing this until you have a decent grasp of a relational database.

Different database tables joining on single table

So imagine you have multiple tables in your database each with it's own structure and each with a PRIMARY KEY of it's own.
Now you want to have a Favorites table so that users can add items as favorites. Since there are multiple tables the first thing that comes in mind is to create one Favorites table per table:
Say you have a table called Posts with PRIMARY KEY (post_id) and you create a Post_Favorites with PRIMARY KEY (user_id, post_id)
This would probably be the simplest solution, but could it be possible to have one Favorites table joining across multiple tables?
I've though of the following as a possible solution:
Create a new table called Master with primary key (master_id). Add triggers on all tables in your database on insert, to generate a new master_id and write it along the row in your table. Also let's consider that we also write in the Master table, where the master_id has been used (on which table)
Now you can have one Favorites table with PRIMARY KEY (user_id, master_id)
You can select the Favorites table and join with each individual table on the master_id and get the the favorites per table. But would it be possible to get all the favorites with one query (maybe not a query, but a stored procedure?)
Do you think that this is a stupid approach? Since you will perform one query per table what are you gaining by having a single table?
What are your thoughts on the matter?
One way wold be to sub-type all possible tables to a generic super-type (Entity) and than link user preferences to that super-type. For example:
I think you're on the right track, but a table-based inheritance approach would be great here:
Create a table master_ids, with just one column: an int-identity primary key field called master_id.
On your other tables, (users as an example), change the user_id column from being an int-identity primary key to being just an int primary key. Next, make user_id a foreign key to master_ids.master_id.
This largely preserves data integrity. The only place you can trip up is if you have a master_id = 1, and with a user_id = 1 and a post_id = 1. For a given master_id, you should have only one entry across all tables. In this scenario you have no way of knowing whether master_id 1 refers to the user or to the post. A way to make sure this doesn't happen is to add a second column to the master_ids table, a type_id column. Type_id 1 can refer to users, type_id 2 can refer to posts, etc.. Then you are pretty much good.
Code "gymnastics" may be a bit necessary for inserts. If you're using a good ORM, it shouldn't be a problem. If not, stored procs for inserts are the way to go. But you're having your cake and eating it too.
I'm not sure I really understand the alternative you propose.
But in general, when given the choice of 1) "more tables" or 2) "a mega-table supported by a bunch of fancy code work" ..your interests are best served by more tables without the code gymnastics.
A Red Flag was "Add triggers on all tables in your database" each trigger fire is a performance hit of it's own.
The database designers have built in all kinds of technology to optimize tables/indexes, much of it behind the scenes without you knowing it. Just sit back and enjoy the ride.
Try these for inspiration Database Answers ..no affiliation to me.
An alternative to your approach might be to have the favorites table as user_id, object_id, object_type. When inserting in the favorites table just insert the type of the favorite. However i dont see a simple query being able to work with your approach or mine. One way to go about it might be to use UNION and get one combined resultset and then identify what type of record it is based on the type. Another thing you can do is, turn the UNION query into a MySQL VIEW and simply query that VIEW.
The benefit of using a single table for favorites is a simplicity, which some might consider as against the database normalization rules. But on the upside, you dont have to create so many favorites table and you can add anything to favorites easily by just coming up with a new object_type identifier.
It sounds like you have an is-a type relationship that needs to be modeled. All of the items that can be favourited are a type of "item". It sounds like you are on the right track, but I wouldn't use triggers. What could be the right answer if I have understood correctly, is to pull all the common fields into a single table called items (master is a poor name, master of what?), this should include all the common data that would be needed when you need a users favourite items, I'd expect this to include fields like item_id (primary key), item_type and human_readable_name and maybe some metadata about when the item was created, modified etc. Each of your specific item types would have its own table containing data specific to that item type with an item_id field that has a foreign key relationship to the item table. Then you'd wrap each item type in its own insertion, update and selection SPs (i.e. InsertItemCheese, UpdateItemMonkey, SelectItemCarKeys). The favourites table would then work as you describe, but you only need to select from the item table. If your app needs the specific data for each item type, it would have to be queried for each item (caching is your friend here).
If MySQL supports SPs with multiple result sets you could write one that outputs all the items as a result set, then a result set for each item type if you need all the specific item data in one go. For most cases I would not expect you to need all the data all the time.
Keep in mind that not EVERY use of a PK column needs a constraint. For example a logging table. Even though a logging table has a copy of the PK column from the table being logged, you can't build a constraint.
What would be the worst possible case. You insert a record for Oprah's TV show into the favorites table and then next year you delete the Oprah Show from the list of TV shows but don't delete that ID from the Favorites table? Will that break anything? Probably not. When you join favorites to TV shows that record will fall out of the result set.
There are a couple of ways to share values for PK's. Oracle has the advantage of sequences. If you don't have those you can add a "Step" to your Autonumber fields. There's always a risk though.
Say you think you'll never have more than 10 tables of "things which could be favored" Then start your PK's at 0 for the first table increment by 10, 1 for the second table increment by 10, 2 for the third... and so on. That will guarantee that all the values will be unique across those 10 tables. The risk is that a future requirement will add table 11. You can always 'pad' your guestimate