MYSQL DB Best method to store keywords and URL index - mysql

Which of these methods would be the most efficient way of storing, retrieving, processing and searching a large (millions of records) index of stored URLs along with there keywords.
Example 1: (Using one table)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com videos,photos,images
2 yoursite.com videos,games
3 hissite.com games,images
4 hersite.com photos,pictures
---------------------------------------------------------
Example 2: (one-to-one Relationship from one table to another)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_KEYWORDS---------------------------------------------
ID DOMAIN_ID KEYWORDS
1 1 videos,photos,images
2 2 videos,games
3 3 games,images
4 4 photos,pictures
---------------------------------------------------------
Example 3: (one-to-one Relationship from one table to another (Using a reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 2 2
3 3 3
4 4 4
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos,photos,images
2 videos,games
3 games,images
4 photos,pictures
---------------------------------------------------------
Example 4: (many-to-many Relationship from url to keyword ID (using reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 1 2
3 1 3
4 2 1
5 2 4
6 3 4
7 3 3
8 4 2
9 4 5
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos
2 photos
3 images
4 games
5 pictures
---------------------------------------------------------
My understanding is that Example 1 would take the largest amount of storage space however searching through this data would be quick (Repeat keywords saved multiple times, however keywords are sat next to the relevant domain)
wWhereas Example 4 would save a tons on storage space but searching through would take longer. (Not having to store duplicate keywords, however referencing multiple keywords for each domain would take longer)
Could anyone give me any insight or thoughts on which the best method would be to utilise when designing a database that can handle huge amounts of data? With the foresight that you may want to display a URL with its assosicated keywords OR search for one or more keywords and bring up the most relevant URLs

You do have a many-to-many relationship between url and keywords. The canonical way to represent this in a relational database is to use a bridge table, which corresponds to example 4 in your question.
Using the proper data structure, you will find out that the queries will be much easier to write, and as efficient as it gets.
I don't know what drives you to think that searchin in a structure like the first one will be faster. This requires you to do pattern matching when searching for each single keyword, which is notably slow. On the other hand, using a junction table lets you search for exact matches, which can take advantage of indexes.
Finally, maintaining such a structure is also much easier; adding or removing keywords can be done with insert and delete statements, while other structures require you do do string manipulation in delimited list, which again is tedious, error-prone and inefficient.

None of the above.
Simply have a table with 2 string columns:
CREATE TABLE domain_keywords (
domain VARCHAR(..) NOT NULL,
keyword VARCHAR(..) NOT NULL,
PRIMARY KEY(domain, keyword),
INDEX(keyword, domain)
) ENGINE=InnoDB
Notes:
It will be faster.
It will be easier to write code.
Having a plain id is very much a waste.
Normalizing the domain and keyword buys little space savings, but at a big loss in efficiency.
"Huse database"? I predict that this table will be smaller than your Domains table. That is, this table is not your main concern for "huge".

Related

DB design little or too much data

I'm currently working on a little project that uses MySQL. However I'm struggling with the database design. Currently I've come up with 2 designs, one stores more data but is actually the way I want it to be, however this way makes it really hard to work with the data. The other way is I think more basic and simplifies a lot of things but stores less data.
Design 1
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
time_achieved
1
1
1
2021-04-17 17:30:04
2
1
1
2021-04-17 17:30:03
3
1
1
2021-04-17 17:30:17
4
1
1
2021-04-17 17:30:22
5
1
1
2021-04-17 17:30:34
6
1
2
2021-04-17 17:30:42
7
1
2
2021-04-17 17:30:54
Design 2
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
count
1
1
1
5
2
1
2
2
Basically we have items that can be anything, they include a description to specify what they actually are. A user can collect items (a lot). These are stored in the user_items table which contains a FK user_id and item_id to the users and items table. The users table is left out for simplicity.
As you can see design 1 stores a lot more rows for the user_items table, this allows us to add more information (time_achieved and more) per item that a user achieved. However this results in more rows and probably a harder time queriyng. Design 2 on the other hand simply adds a count column to determine how many items the user has, but this is very limiting because we cannot add more data (achieved time..) per user_item.
I'm not sure if design 1 is the right and only design for what we want to achieve. Basically we really want to store additional metadata per user_item but I just don't know if this is the right design since it quickly fills up the database. Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Thanks in advance.
Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Design 1 should work.
This design will also work but quickly fills up, more efficient.
id, item_id,Item_des,Item_qty,user_id,username,time_created all in one table.
some of the values will be repeated.

How to design MySQL Table Structure for search optimization [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 2 years ago.
Hello to Folks out there,
I need some suggestion on How to design MySQL Table Structure for search optimization.
I am creating a Real estate website. In that I have property table and all its associated tables.
I can design my table for saving these records in two ways.
I have amenity master table
id property_name
----------------
1 Property A
2 Property B
3 Property C
Approach 1
Property Table
id property_name
----------------
1 Property A
2 Property B
3 Property C
Property_Amenities table
id p_id amenity_id
------------------------
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 1
Approach 2
Property Table
id property_name amenity_id
----------------------------
1 Property A 1,2,3,4,5
2 Property B 1,4,7,9,12
3 Property C 3,4,7,8,9,10
Approach 1 Query : I can join tables and get the all the amenities name for a particular property. Add the required index for optimization. For A property there will be 20 amenities on averages. Suppose I have 100K property records then to get amenities for particular property, MySQL query will search in 2 millions records of Property_Amenities table.
Approach 2 Query : I can search using FIND_IN_SET or IN MySQL operator. But I was going through this search topic and it seems that this approach will be much resource intensive and cost more for same amount of data i.e 100k property records.
Any suggestion will be appropriated. What your thought on this scenario or any other approach I should follow.
Use the first approach. Period. The second is wrong, wrong, wrong. Here are some reasons:
A SQL string should not be used to store multiple values.
Integers should be stored using the proper type.
Foreign key relationships should be declared.
SQL has very poor string handling methods.
SQL has a great way to store lists; it is called a "table" not a "string".

When is it better to flatten out data using comma separated values to improve search query performance?

My question about SEARCH query performance.
I've flattened out data into a read-only Person table (MySQL) that exists purely for search. The table has about 20 columns of data (mostly limited text values, dates and booleans and a few columns containing unlimited text).
Person
=============================================================
id First Last DOB etc (20+ columns)...
1 John Doe 05/02/1969
2 Sara Jones 04/02/1982
3 Dave Moore 10/11/1984
Another two tables support the relationship between Person and Activity.
Activity
===================================
id activity
1 hiking
2 skiing
3 snowboarding
4 bird watching
5 etc...
PersonActivity
===================================
id PersonId ActivityId
1 2 1
2 2 3
3 2 10
4 2 16
5 2 34
6 2 37
7 2 38
8 etc…
Search considerations:
Person table has potentially 200-300k+ rows
Each person potentially has 50+ activities
Search may include Activity filter (e.g., select persons with one and/or more activities)
Returned results are displayed with person details and activities as bulleted list
If the Person table is used only for search, I'm wondering if I should add the activities as comma separated values to the Person table instead of joining to the Activity and PersonActivity tables:
Person
===========================================================================
id First Last DOB Activity
2 Sara Jones 04/02/1982 hiking, snowboarding, golf, etc.
Given the search considerations above, would this help or hurt search performance?
Thanks for the input.
Horrible idea. You will lose the ability to use indexes in querying. Do not under any circumstances store data in a comma delimited list if you ever want to search on that column. Realtional database are designed to have good performance with tables joined together. Your database is relatively small and should have no performance issues at all if you index properly.
You may still want to display the results in a comma delimted fashion. I think MYSQL has a function called GROUP_CONCAT for that.

Storing data in a link table

Supoose I have the following:
tbl_options
===========
id name
1 experience
2 languages
3 hourly_rate
tbl_option_attributes
=====================
id option_id name value
1 1 beginner 1
2 1 advanced 2
3 2 english 1
4 2 french 2
5 2 spanish 3
6 3 £10 p/h 10
7 3 £20 p/h 20
tbl_user_options
================
user_id option_id value
1 1 2
1 2 1
1 2 2
1 2 3
1 3 20
In the above example tbl_user_options stores option data for the user. We can store multiple entries for some options.
Now I wish to extend this, i.e. for "languages" I want the user to be able to specify their proficiency in a language (basic/intermediate/advanced). There will also be other fields that will have extended attributes.
So my question is, can these extended attributes be stored in the same table (tbl_user_options) or do I need to create more tables? Obviously if I put in a field "language_proficiency" it won't apply to the other fields. But this way I only have one user options table to manage. What do you think?
EDIT: This is what I propose
tbl_user_options
================
user_id option_id value lang_prof
1 1 2 null
1 2 1 2
1 2 2 3
1 2 3 3
1 3 20 null
My gut instinct would be to split the User/Language/Proficiency relationship out into its own tables. Even if you kept it in the same table with your other options, you'd need to write special code to handle the language case, so you might as well use a new table structure.
Unless your data model is in constant flux, I would rather have tbl_languages and tabl_user_languages tables to store those types of data:
tbl_languages
================
lang_id name
1 English
2 French
3 Spanish
tbl_user_languages
================
user_id lang_id proficiency hourly_rate
1 1 1 20
1 2 2 10
2 2 1 15
2 2 3 20
3 3 2 10
Designing a system that is "too generic" is a Turing tarpit trap for a relational SQL database. A document-based database is better suited to arbitrary key-value stores.
Excepting certain optimisations, your database model should match your domain model as closely as possible to minimise the object-relational impedance mismatch.
This design lets you display a sensible table of user language proficiencies and hourly rates with only two inner joins:
SELECT
ul.user_id,
u.name,
l.name,
ul.proficiency,
ul.hourly_rate
FROM tbl_user_languages ul
INNER JOIN tbl_languages l
ON l.lang_id = ul.lang_id
INNER JOIN tbl_users u
ON u.user_id = ul.user_id
ORDER BY
l.name, u.hour
Optionally you can split out a list of language proficiencies into a tbl_profiencies table, where 1 == Beginner, 2 == Advanced, 3 == Expert and join it onto tbl_user_languages.
i'm thinking it's a mistake to put "languages" as an option. while reading your text it seems to me that english is an option, and it might have an attribute from option_attributes.

Relational Database Design (MySQL)

I have a table User that stores user information - such as name, date of birth, locations, etc.
I have also created a link table called User_Options - for the purpose of storing multi-value attributes - this basically stores the checkbox selections.
I have a front-end form for the user to fill in and create their user profile. Here are the tables I have created to generate the checkbox options:
Table User_Attributes
=====================
id attribute_name
---------------------
1 Hobbies
2 Music
Table User_Attribute_Options
======================================
id user_attribute_id option_name
--------------------------------------
1 1 Reading
2 1 Sports
3 1 Travelling
4 2 Rock
5 2 Pop
6 2 Dance
So, on the front-end form there are two sets of checkbox options - one set for Hobbies and one set for Music.
And here are the User tables:
Table User
========================
id name age
------------------------
1 John 25
2 Mark 32
Table User_Options
==================================================
id user_id user_attribute_id value
--------------------------------------------------
1 1 1 1
2 1 1 2
3 1 2 4
4 1 2 5
5 2 1 2
6 2 2 4
(in the above table 'user_attribute_id' is the ID of the parent attribute and 'value' is the ID of the attribute option).
So I'm not sure that I've done all this correctly, or efficiently. I know there is a method of storing hierarchical data in the same table but I prefer to keep things separate.
My main concern is with the User_Options table - the idea behind this is that there only needs to be one link table that stores multi-value attributes, rather than have a table for each and every multi-value attribute.
The only thing I can see that I'd change is that in the association table, User_Options, you have an id that doesn't seem to serve a purpose. The primary key for that table would be all three columns, and I don't think you'd be referring to the options a user has by an id--you'd be getting them by user_id/user_attribute_id. For example, give me all the user options where user is 1 and user attribute id is 2. Having those records uniquely keyed with an additional field seems extraneous.
I think otherwise the general shape of the tables and their relationships looks right to me.
There's nothing wrong with how you've done it.
It's possible to make things more extensible at the price of more linked table references (and in the composition of your queries). It's also possible to make things flatter, and less extensible and flexible, but your queries will be faster.
But, as is usually the case, there's more than one way to do it.