How to normalize category database [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've got a MySQL DB with ~5000 rows. It occurred to me that before adding anymore records or columns I need to get the database normalized. But even with all the research I've done I'm struggling to understand several concepts. Here is an example of my existing database schema.
PK Category1 Category2 Category3 Produce_String Keywords Zip State City Country
1 Vegetable Potato f5kkfid34fbn organic ... ..... ... .....
2 Vegetable Potato plf85jfuvj organic,fresh ... ..... ... .....
3 Vegetable Cherry Tomato jf9vmu37jg9 fresh
4 Fruit Lemon kfkt8hkf0e fresh,yellow
5 Fruit Lemon fkg8rr03gnf
6 Fruit Red Apple fkf9gkty367r6 crispy
My main misunderstanding, is how to relate the data to one another once the columns are separated into individual tables? For example, in a DB client I can see the rows and how they relate to one another, but if I separate them this will no longer be the case. I am also concerned with having to update multiple tables for the same record but I suppose this is unavoidable.
Also, I'm not clear on the proper way to normalize this. My mind tells me to only separate the Keywords column since it's the only column that has comma separated entries. But by normalization standards I believe I need to separate the categories, keywords, and location.
EDIT
Another concern I have, is that if I put the categories in a separate table, each with their own row, I lose the structure. So I lose the ability to sort by the specific categories. For example, the vegetable category would not be related to a fruit. Since the Produce_String is unique, could I use it as the foreign key in the other tables?

You can have separate Category and Keywords tables
category keyword
------------------------- -------------------------
id | name id | name
1 | Vegetable 1 | organic
2 | Potato 2 | fresh
Than make two additional tables for MANY TO MANY relations:
category_to_product keyword_to_product
------------------------- -------------------------
category_id | product_id keyword_id | product_id
1 | 1 1 | 1
1 | 2 1 | 2
2 | 1 2 | 2
And to update categories for product #1:
DELETE * FROM `category_to_product` WHERE `product_id` = :product_id;
INSERT INTO `category_to_product` (`category_id`, `product_id`) VALUES (1, 1), (2, 1), (8, 1);

This is a question with a potentially long answer, and you might be better off starting with somewhere like wikipedia or similar. But in a nutshell:
Normalisation usually solves more problems than it causes. Consider someone who spelt "Vegetable" as "Vegteable" for one of the rows in your example, or consider if you wanted to add a fourth category. Or what about if you wanted to change all instances of the category "Baby Marrow" to "Zuccini". You are correct that both of these could be implemented as separate tables.
One of the criteria that you can use for deciding whether to normalise or not is to think about where you want to control the integrity of the data. You may be in control of application code now, that ensures that the category names are always used consistently, but it's hard to foresee what applications will come in the future. Keeping the list of categories in it's own table makes sure that when you link two products to the same category, they are indeed linked to the same category (i.e. the single row "Vegetable" in the category table). When you change a category, you can change it in one place. You can easily find all products linked to a single category before you delete a category, and so on.
Yes, once the data is in separate tables, you don't see it all in one row any longer, but joining the data is what relational data is all about, and you can probably use database views to recreate the layout that you've shown us from the underlying normalised data. It is very normal to join several tables in an SQL select statement.

Related

MYSQL DB Best method to store keywords and URL index

Which of these methods would be the most efficient way of storing, retrieving, processing and searching a large (millions of records) index of stored URLs along with there keywords.
Example 1: (Using one table)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com videos,photos,images
2 yoursite.com videos,games
3 hissite.com games,images
4 hersite.com photos,pictures
---------------------------------------------------------
Example 2: (one-to-one Relationship from one table to another)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_KEYWORDS---------------------------------------------
ID DOMAIN_ID KEYWORDS
1 1 videos,photos,images
2 2 videos,games
3 3 games,images
4 4 photos,pictures
---------------------------------------------------------
Example 3: (one-to-one Relationship from one table to another (Using a reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 2 2
3 3 3
4 4 4
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos,photos,images
2 videos,games
3 games,images
4 photos,pictures
---------------------------------------------------------
Example 4: (many-to-many Relationship from url to keyword ID (using reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 1 2
3 1 3
4 2 1
5 2 4
6 3 4
7 3 3
8 4 2
9 4 5
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos
2 photos
3 images
4 games
5 pictures
---------------------------------------------------------
My understanding is that Example 1 would take the largest amount of storage space however searching through this data would be quick (Repeat keywords saved multiple times, however keywords are sat next to the relevant domain)
wWhereas Example 4 would save a tons on storage space but searching through would take longer. (Not having to store duplicate keywords, however referencing multiple keywords for each domain would take longer)
Could anyone give me any insight or thoughts on which the best method would be to utilise when designing a database that can handle huge amounts of data? With the foresight that you may want to display a URL with its assosicated keywords OR search for one or more keywords and bring up the most relevant URLs
You do have a many-to-many relationship between url and keywords. The canonical way to represent this in a relational database is to use a bridge table, which corresponds to example 4 in your question.
Using the proper data structure, you will find out that the queries will be much easier to write, and as efficient as it gets.
I don't know what drives you to think that searchin in a structure like the first one will be faster. This requires you to do pattern matching when searching for each single keyword, which is notably slow. On the other hand, using a junction table lets you search for exact matches, which can take advantage of indexes.
Finally, maintaining such a structure is also much easier; adding or removing keywords can be done with insert and delete statements, while other structures require you do do string manipulation in delimited list, which again is tedious, error-prone and inefficient.
None of the above.
Simply have a table with 2 string columns:
CREATE TABLE domain_keywords (
domain VARCHAR(..) NOT NULL,
keyword VARCHAR(..) NOT NULL,
PRIMARY KEY(domain, keyword),
INDEX(keyword, domain)
) ENGINE=InnoDB
Notes:
It will be faster.
It will be easier to write code.
Having a plain id is very much a waste.
Normalizing the domain and keyword buys little space savings, but at a big loss in efficiency.
"Huse database"? I predict that this table will be smaller than your Domains table. That is, this table is not your main concern for "huge".

finding most 'popular' items in multiple tables in mySQL [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have multiple tables of items which are 'ordered' by their ratings/popularity.
I need to combine all the tables into one table with a top 10.
The top 10 will combine the number of times an entry (with wildcard as the names may be slightly different) appears in all the lists and it's position in the tables.
is this possible?
I've researched Joins but it seems quite a complicated procedure given there are two factors (nubmer of entries and position in the tables).
Apologies for being vague, I didn't think I was doing so. This is my first question on stackoverflow
table 1 table 2 table 3
--------------------
bob | bob | Ian
fred | james |john
kate | fred | bob
mary | brian | brian
the 'rankings' results of the three tables need to appear in a final table (called 'final' for example)
As you can see Bob would rank highly on 'final'.
But Ian appears only once, even though he is top of the list in table 3.
Fred appears in position 2 and position 3 so should he be higher or lower than Ian.
would I need an algorithm for the sorting or is there some trick in mySQl that will examine the rankings?
You can return them with a ranking, but you need to define how that rank applies.
For example if you just return the ranking from each table then Bob appears twice in the first position. If you add those 2 ranks together it gives 2. How do you compare that to Ian who is only ranked 1 once.
For this you are probably best building a ranking from the last row (or calculating it as total number of rows - ranking).
You can get a basic ranking from each table with the following:-
SELECT some_name, #rank_1:=#rank_1 + 1 AS ranking
FROM table_1
CROSS JOIN (SELECT #rank_1:=0) sub_1
UNION ALL
SELECT some_name, #rank_2:=#rank_2 + 1 AS ranking
FROM table_2
CROSS JOIN (SELECT #rank_2:=0) sub_2
UNION ALL
SELECT some_name, #rank_2:=#rank_2 + 1 AS ranking
FROM table_2
CROSS JOIN (SELECT #rank_2:=0) sub_3
but this will give you each record from each table and their ranking.
As it is though, you have nothing to say that Bob is the first record on table_1. While it may be the first record, as far as the order returned this is not a certainty.

When is it better to flatten out data using comma separated values to improve search query performance?

My question about SEARCH query performance.
I've flattened out data into a read-only Person table (MySQL) that exists purely for search. The table has about 20 columns of data (mostly limited text values, dates and booleans and a few columns containing unlimited text).
Person
=============================================================
id First Last DOB etc (20+ columns)...
1 John Doe 05/02/1969
2 Sara Jones 04/02/1982
3 Dave Moore 10/11/1984
Another two tables support the relationship between Person and Activity.
Activity
===================================
id activity
1 hiking
2 skiing
3 snowboarding
4 bird watching
5 etc...
PersonActivity
===================================
id PersonId ActivityId
1 2 1
2 2 3
3 2 10
4 2 16
5 2 34
6 2 37
7 2 38
8 etc…
Search considerations:
Person table has potentially 200-300k+ rows
Each person potentially has 50+ activities
Search may include Activity filter (e.g., select persons with one and/or more activities)
Returned results are displayed with person details and activities as bulleted list
If the Person table is used only for search, I'm wondering if I should add the activities as comma separated values to the Person table instead of joining to the Activity and PersonActivity tables:
Person
===========================================================================
id First Last DOB Activity
2 Sara Jones 04/02/1982 hiking, snowboarding, golf, etc.
Given the search considerations above, would this help or hurt search performance?
Thanks for the input.
Horrible idea. You will lose the ability to use indexes in querying. Do not under any circumstances store data in a comma delimited list if you ever want to search on that column. Realtional database are designed to have good performance with tables joined together. Your database is relatively small and should have no performance issues at all if you index properly.
You may still want to display the results in a comma delimted fashion. I think MYSQL has a function called GROUP_CONCAT for that.

Advice on linking product codes to a product in a MySQL database

I need some advice of how to setup my tables I currently have a product table and a product codes table.
In the codes table I have an id and a title such as:
1 567902
2 345789
3 345678
there can be many items in this table.
In my product table I have the usual product id,title, etc but also a code id column that I'm currently storing a comma separate list of ids for any codes the product needs to reference.
in that column I could end up with ids like: 2,5,6,9
I'm going to need to be able to search the products table looking for code ids for a specific set this is where I've come into problems trying to use id IN ($var) or FIND_IN_SET is proving problematic I've been advised to restructure it I'm happy to do just wondering what the best method would be.
Sounds like you have two choices. If this is a 1 to many relationship, then you need to have the foreign key in the code table, not the product table.
i.e.
codeId code productId
1 567902 2
2 345789 6
3 345678 9
4 345690 9
The other option is to have another table which contains productId and codeId (both as foreign keys), this is a many-to-many relationship. This is what you should go for if a code can be assigned to multiple products (I assume not). It will look something like this:
codeId productId
1 2
1 10
2 6
3 9
4 9
I think the first option is what you need.

Relational Database Design (MySQL)

I have a table User that stores user information - such as name, date of birth, locations, etc.
I have also created a link table called User_Options - for the purpose of storing multi-value attributes - this basically stores the checkbox selections.
I have a front-end form for the user to fill in and create their user profile. Here are the tables I have created to generate the checkbox options:
Table User_Attributes
=====================
id attribute_name
---------------------
1 Hobbies
2 Music
Table User_Attribute_Options
======================================
id user_attribute_id option_name
--------------------------------------
1 1 Reading
2 1 Sports
3 1 Travelling
4 2 Rock
5 2 Pop
6 2 Dance
So, on the front-end form there are two sets of checkbox options - one set for Hobbies and one set for Music.
And here are the User tables:
Table User
========================
id name age
------------------------
1 John 25
2 Mark 32
Table User_Options
==================================================
id user_id user_attribute_id value
--------------------------------------------------
1 1 1 1
2 1 1 2
3 1 2 4
4 1 2 5
5 2 1 2
6 2 2 4
(in the above table 'user_attribute_id' is the ID of the parent attribute and 'value' is the ID of the attribute option).
So I'm not sure that I've done all this correctly, or efficiently. I know there is a method of storing hierarchical data in the same table but I prefer to keep things separate.
My main concern is with the User_Options table - the idea behind this is that there only needs to be one link table that stores multi-value attributes, rather than have a table for each and every multi-value attribute.
The only thing I can see that I'd change is that in the association table, User_Options, you have an id that doesn't seem to serve a purpose. The primary key for that table would be all three columns, and I don't think you'd be referring to the options a user has by an id--you'd be getting them by user_id/user_attribute_id. For example, give me all the user options where user is 1 and user attribute id is 2. Having those records uniquely keyed with an additional field seems extraneous.
I think otherwise the general shape of the tables and their relationships looks right to me.
There's nothing wrong with how you've done it.
It's possible to make things more extensible at the price of more linked table references (and in the composition of your queries). It's also possible to make things flatter, and less extensible and flexible, but your queries will be faster.
But, as is usually the case, there's more than one way to do it.